Observability (Re)defined

June 29, 2021

• Humio Staff • Next-Gen SIEM & Log Management

This blog was originally published Dec. 30, 2019 on humio.com. Humio is a CrowdStrike Company.

Observability. It’s a buzzword that has been making the rounds in the IT Ops, DevOps, and Security Ops space for at least a few years. If your job involves understanding or maintaining the functionality and performance of applications and systems, you’ve heard the term used to describe knowing what’s happening with the components of the system. Observability is an important concept for those responsible for the health of any complex IT infrastructure, and it’s worth making sure we agree on what it means and how to achieve it.

Let’s Start at the Start

I love writing, and one of the things I enjoy is delving into the meaning and origin of words. So I think it’s useful to start with a basic definition. In a general sense, to

observe

something is to watch it carefully with the hope of arriving at a judgment. Something is

observable if it can be seen or watched — like the observable universe. In physics, an observable is a physical property that can be measured directly, like temperature or position. From

observe

and

observable

comes the term

observability

(think “observe-ability”), the

ability

to observe. It describes a property of something, so something is either observable or it isn’t.

1960: Observability and Control Theory

In the technical world, the term

observability

has its roots in Control Theory, a field of mathematics for dynamic engineering and mechanical systems. A large system can be optimized if the right controls are in place, and those controls can work only if they know how the system is performing. In a system, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

In 1960,

Rudolf E. Kálmán

introduced the term to describe how well a system can be measured by its outputs. In mechanical systems, sensors and detectors measure the output to inform controls that are in place.

2013: The Twitter Observability Team Describes Its Mission

In September 2013, engineers at Twitter wrote a blog post called

Observability at Twitter. It’s one of the first times the term observability was used in the context of IT systems. It’s especially noteworthy that it is used at Twitter, one of the platforms with the most real-time event data in the world. "Engineers at Twitter need to determine the performance characteristics of their services, the impact on upstream and downstream services, and get notified when services are not operating as expected," the post states. "It is the Observability team’s mission to analyze such problems with our unified platform for collecting, storing, and presenting metrics. Creating a system to handle this job at Twitter scale is really difficult." The post goes on to explain how they "capture, store, query, visualize and automate this entire process."

2016: The (Four) Pillars of Observability at Twitter

A few years later,

Anthony Asta

from the Observability Engineering team at Twitter created a blog post called

Observability at Twitter: technical overview, part I, where he outlined four pillars of their team’s charter. This is one of the early instances of Observability Pillars being described by data types. Note that Alerting and Visualizations were included, even though they are outputs of data, rather than data types themselves. The post states: "The Observability Engineering team at Twitter provides full-stack libraries and multiple services to our internal engineering teams to monitor service health, alert on issues, support root cause investigation by providing distributed systems call traces, and support diagnosis by creating a searchable index of aggregated application/system logs." The post proceeds to outline the four pillars of the team's charter:

Monitoring
Alerting/visualization
Distributed systems tracing infrastructure
Log aggregation/analytics

2017: The Three Pillars of Observability

In February 2017,

Peter Bourgon

attended the 2017 Distributed Tracing Summit. He was part of a discussion about defining and scoping how tracing helps provide observability. In a blog post called

Metrics, tracing, and logging, he describes how he thought they could probably map out the domain of instrumentation, or observability, as a sort of Venn diagram. He had a think over a coffee break and came up with this.

Image source: https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

2018: The Year of Observability

In 2018, observability and the three pillars of logging, metrics, and tracing became part of the mainstream conversation. In March 2018, QCon added an “Observability” track to the conference, and leaders in the industry began evangelizing the importance of observability. Every company involved in this track described how their company’s products and services provide observability. In June, 2018,

Geeta Schmidt, CEO of Humio, added her thoughts in a Medium article called

Data-Driven Observability & Logs. She points out that simply having tools in place for log management, metrics, and tracing isn’t enough to get value from them. She insists that a cultural shift is needed that values facts and feedback, is data-driven during debugging, and uses this mindset to iterate, improve, and solve problems. In July 2018,

Cindy Sridharan published a definitive book for O’Reilly called Distributed Systems Observability. The book outlines the three pillars of observability, and details which tools to use and when. By the second half of 2018, nearly everyone in the infrastructure systems space claims that if they don’t provide comprehensive observability with their product or service, they at least provide a critical component. As you can see, Humio was included. As 2018 came to a close, cracks began to appear in the model of the three pillars of observability. On September 25, 2018,

Charity Majors, CTO of Honeycomb, warns (in her inimitable way) that describing observability as three pillars limits the discussion. She shared

a series of Tweets where she boldly exclaims “THERE ARE NO THREE PILLARS OF OBSERVABILITY,” adding "and the fact that everybody keeps blindly repeating this mantra (and cargo culting these primitives) is probably why our observability tooling is 10 years behind the rest of our software tool chain." She further states: "it’s the EVENT that is the execution path of your code through the system. It is the correct lens through which to understand your systems from the inside out.” Many in the industry took heed, and began to think of observability apart from the three pillars.

2019: Passionate Debate Ensues

As companies purchased and deployed tools to collect data from each of the three pillars, they found they had access to a lot more system data, but they still didn’t achieve 100% observability of their infrastructure. It turns out that the tools that promised observability couldn’t handle the volume of data coming from terabytes of unstructured data. Engineers had to limit the data they would keep to stay within limits capped by expensive licenses. They found the tools they were using were too slow, due to indexing latency and other issues that made real-time observability impossible. And the new systems were too complicated, with difficult deployment, impractical and expensive maintenance, and interfaces that were inconsistent, hard to learn, and hard to use. Across the industry, tech leaders joined the conversation to redefine observability. In February 2019,

Ben Sigelman, CEO and co-founder of LightStep, fired a shot across the bow with a blog post called

Three Pillars With Zero Answers: A New Scorecard for Observability. He describes how high-cardinality metrics overwhelm systems. He points out that logs become too expensive if you want to ingest, retain, and store the massive volumes of data coming from distributed systems. And traces become impractical if you don’t determine the right samples. Even if all that can be overcome, he points out that

none of that directly addresses a particular pain point, use case, or business need. In August 2019,

Mads Hartmann, Site Reliability Engineer at Glitch, did a deep dive into observability over several months, and published his thoughts in Journey into Observability: Reading material. His takeaway is that you can’t achieve observability simply from the telemetry (the process of recording and transmitting the readings of instruments) your system produces. He proposes an alternative view of observability:

...Observability is all about being able to ask questions of your system and get answers based on the existing telemetry it produces; if you have to re-configure or modify the service to get answers to your questions you haven’t achieved observability yet.

Charity Majors agrees with the direction Mads is heading. In August 2019, she published an article in The New Stack: Observability — A 3-Year Retrospective. In it she leads the challenge to appropriate a new definition for observability:

If we do not appropriate “observability” to denote the differences between known unknowns and unknown-unknowns, between passive monitoring and exploratory debugging, it is not clear what other terms are available to us (and unclear that the same fate will not befall them).

She goes on to suggest that realizing the promise of observability comes from being able to inspect the internal state of your systems and be able to ask any question to understand it.

By using events and passing along the full context, conversely I can ask any question of my systems and inspect its internal state, therefore I can understand any state my system has gotten itself into — even if I have never seen it before, never conceived of it before! I can understand anything that is happening inside my system, any state it may be in — without having to ship new code to handle the state. This is key. This is observability.

2020: Observability (Re)defined

The industry is aligning to the latest thinking, that observability isn’t a feature you can install or a service you can subscribe to. It’s not simply the deployment and collection of three types of data. Observability is something you either have, or you don’t. It only is achieved when you have all the data to answer any question, whether predictable or not. We agree with those that think of it this way:

Observability is when you’re able to understand the internal state of a system from the data it provides, and you can explore that data to answer any question about what happened and why.

I have no doubt that the definition will evolve further.

How Humio Helps Make Your System Observable

At Humio, we believe in the value of data-driven logging and the benefits companies can derive from this in their observability stack. Because the Humio platform can collect any unstructured event data — including logs, metrics, and traces — it makes any system that provides event data observable. With simple pricing and unique product, we are on a mission to bring this value to engineering teams who’ve been struggling until now.

About the Humio Platform

Humio delivers real-time monitoring and investigation capabilities for developers and operations teams. Humio is:

SCALABLE:

Proprietary time-series database engine efficiently handles multi-TB/day log volumes (one single instance of Humio can ingest more than 1 TB of log volume/day)
FAST:

Real-time streaming capabilities built-in for live query, dashboard capabilities, and alerts.
FLEXIBLE:

Open ingest API and support for open source data shippers and common data formats
EASY TO USE:

Familiar query structure and intuitive web interface that developers and operations teams love to use
AFFORDABLE:

Humio offers the lowest

Total Cost of Ownership

of any major competitor, and has an industry-leading Unlimited Plan.

Key Features

Humio is available in both On-Premises and Cloud versions
Streaming queries and dashboards stored in an in-memory state machine for instant data access
Linear scaling and blazing-fast search speed
Ad-hoc and historical search with a simple search language
Live dashboards and shareable dashboard widgets
Support for multi-tenant infrastructures
Open API for data ingest and support for open-source log shippers (Logstash and Filebeat) and common log and data formats (Fluentd, Syslog, NGINX, JSON, Kafka, Zeek, and more)
Alerting integrations with Slack, PagerDuty, VictorOps, OpsGenie, email, and webhooks.

The Humio Unlimited Plan

Humio’s Unlimited Plan gives organizations access to all the data they need to prepare for the unknown. With unlimited log aggregation, real-time search and exploration, and the ability to retain compressed data for longer, Humio minimizes recovery time, and makes it easy to get to the root cause of unexpected issues. Find out more from our website’s

Why Unlimited? page.

Get started by downloading a free 30-day trial version (no credit card required), or requesting a demo.

Privacy
Request Info
Blog
Contact Us
1.888.512.8906
Accessibility

Observability (Re)defined

Let’s Start at the Start

1960: Observability and Control Theory

2013: The Twitter Observability Team Describes Its Mission

2016: The (Four) Pillars of Observability at Twitter

2017: The Three Pillars of Observability

2018: The Year of Observability

2019: Passionate Debate Ensues

2020: Observability (Re)defined

How Humio Helps Make Your System Observable

About the Humio Platform

Key Features

The Humio Unlimited Plan

CrowdStrike Falcon Platform

Ready to protect your business?

Subscribe

See CrowdStrike Falcon in action

Observability (Re)defined

Let’s Start at the Start

1960: Observability and Control Theory

2013: The Twitter Observability Team Describes Its Mission

2016: The (Four) Pillars of Observability at Twitter

2017: The Three Pillars of Observability

2018: The Year of Observability

2019: Passionate Debate Ensues

2020: Observability (Re)defined

How Humio Helps Make Your System Observable

About the Humio Platform

Key Features

The Humio Unlimited Plan

Related Content

Falcon Platform IOAs Arrive in Falcon Next-Gen SIEM to Identify New Threats

5 High-Impact Use Cases for Falcon Onum

Detecting CVE-2026-20929: Kerberos Authentication Relay via CNAME Abuse

CrowdStrike Falcon Platform

Ready to protect your business?

Subscribe

See CrowdStrike Falcon in action