Streaming Live Data is the Heart of Observability

This blog was originally published Feb. 14, 2020 on humio.com. Humio is a CrowdStrike Company.

As a security engineer working on the core team at Humio, I focus on making Humio a powerful part of the security stack for our customers. I have a passion for cybersecurity, and like a lot of our customers, I love breaking stuff in the most unexpected ways so I can fix them.

For security, much of the power of Humio comes from its ability to ingest live streaming data and make it immediately available to update alerts, visualizations, and perform queries. This is the backbone for achieving complete observability and making an environment more secure.

Humio Enables Streaming Observability

Streaming log data is so important because it enables a paradigm shift in IT, DevOps, and Security thinking. In the past, alerts have been triggered in terms of time intervals; an hour, two hours, four hours, a week. Tools have improved so it’s possible to view your data as it flows through your system. Humio does that for us.

It’s all about being able to tell a story with your data. There’s so much happening that if you can’t make educated guesses, you have to assume it’s working, and that’s not always a good assumption.

From a security perspective, you want to be alerted about things as soon as they start, before they cause damage. So “as early as possible” is important. Now that you can have data and alerts working in sync, things can get addressed in real time.

Problems Arise When Now Isn’t Right Now

With other logging solutions you “run this search now,” and even though it seems like it includes current data, it’s really only looking at historic data. You lose granularity, because a complex search may run for, say, four hours. Those searches don’t include new streaming data, so your alerts are up to four hours out of date. So if your eCommerce site goes down — your customers become your notification system, which is not a good look for business and can do real damage to the customer experience. You don’t need to know about an attack four hours ago — you need to know when it starts.

Log delays are another problem for security. If a mobile device has been offline, its logs may not be indexed for up to several days. If you search two hours back based on the timestamp in the event, this can easily result in not seeing those new logs with old timestamps. If an attacker knows this and sets their system time to three hours in the past, you could miss what they’re doing. This isn’t a problem when looking at data as a stream, because the data passes through the Humio state machine, and search results, visualizations, and alerts are updated.

The longer your solution stacks data, the more you have to go through when you perform a search. Looking back a year is around 360 terabytes if you’re logging one TB per day, which is a conservative estimate for a large organization. Most log management solutions grind to a halt if it updates a visualization or alert based on search over months of historical data. Updating streaming data works because if you aggregate as it comes in, you just have to match the throughput of the load. As a developer, I don’t care about a report that’s a month old, because I have no idea what caused a spike a month ago. You want to address and remedy the situation right now, when you’re in the situation.

If you only get information every 30 minutes, you’ll take it because it’s better than nothing. If you can get it in real time, you’ll pick that, because it’s more valuable to your business.

Log Observation Can Prepare You Even More Than Threat Simulation

Real-time observability and response are critical because you can’t really directly control everything anymore — everything is in the cloud and off your local radar. It’s impossible to reproduce your environment to run simulations because environments are so dynamic. You can’t reproduce the impact that 100,000 users have on your system, so you can’t accurately simulate that.

How Humio Uses “Live” Data to Make Your Environment Observable

Humio searches streaming and retained data concurrently with sub-second results because it updates the state of queries as data streams in. You can act on things in real time, in real life, at the speed at which they happen.

Let’s look at a simple dashboard where we’re able to see that something is causing latency in the system.

With Humio’s notifiers set up, you receive a notice triggered by threshold events, so you don’t have to be physically staring at the thing. But looking at this board, I can see that the spike means something is causing a slowdown.

I can look at another chart and see the system healing itself.

The queue length is going back down, and the system is correcting itself, all in a few seconds. If I was to see another spike, I’d know almost immediately to keep watching, because something was going on.

With the color coding, I can also see which system is experiencing the issue — I can see it’s a restart in the cluster because the whole thing acts in unison.

With these dashboard visualizations, I can see an error, I can see how the system is dealing with it, and I can see how it’s recovering.

Another way to look at this same event is to line up the Queue Length with the Event Latency.

You can see the little space where there’s no data, which is where it went down, and then we see the system catch up again. The whole story is there in the dashboards.

Additional Resources

Related Content