Everything You Think You Know About (Storing and Searching) Logs Is Wrong
This blog was originally published Aug. 25, 2020 on humio.com. Humio is a CrowdStrike Company.
Humio’s technology was built out of a need to rethink how log data was collected, stored, and searched. As the requirements for data ingest and management are increasing, traditional logging technologies and the assumptions on which they were built no longer match the reality of what organisations have to manage today.
This article explores some of those assumptions, the changes in technology that impact them, and why Humio’s purpose-built approach is a better option for customers to get value with real-time search and lower costs.
3 assumptions about log data
There are three main assumptions that just don’t hold true today (and we like things that come in threes because it makes for neat sections in a blog).
1. Indexes are for search, therefore searches need indexes – False
Traditional thinking about how to do search at scale comes down to one concept: indexing the data. Indexing traditionally involves scanning the documents in question, extracting and ranking the terms, etc., etc. For many years, the ubiquitous technology for this has been Apache Lucene. This is the underlying technology in the search engines of many tools, and in more recent years has been “industrialized” into a really flexible technology thanks to the work of Elastic with the Elasticsearch tools.
But it’s not the best choice for logs (or more specifically streaming human-readable machine data). The assumption that indexes are best for all search scenarios is wrong.
This is no reflection on the technology itself; it’s designed for randomised search and it does that very well. Elastic gets a pass, they didn’t set out to build a log aggregation and search tool.
The other vendors that did set out to build such a tool and took an index-based approach may also get a pass, because indexing was the prevailing technology at the time.
2. Compression, and the obverse, are slow – Not anymore
Data can be compressed to make storage more efficient, but the perception remains that compressing and decompressing data will slow things down significantly. But compressing data can actually make search faster. There are two pieces to that discussion.
Firstly, if you design and optimise your system around compression, it makes reading, writing, storing, and moving data faster. Humio does exactly that, and you can read about some of this thinking in a Humio blog post: How fast can you grep?. Compression is assumed to be slow because so many users have experienced it in systems where it was introduced as an afterthought, a kludge to help solve the storage requirements of indexed data.
Secondly, compression algorithms are still making progress and being optimised. There are arguments that the latest techniques are reaching theoretical limits of performance, but let’s not declare that everything that will be invented has been.
Humio makes use of the Zstandard family of compression algorithms, and they are FAST. More about that in a bit.
3. Datasets become less manageable with size/age, or are put in the freezer – Datasets are not vegetables!
We often talk to prospective customers that have a requirement for Hot/Warm/Cold storage; and in the context of uncompressed, indexed data, this can make sense. People are used to the concept that storage is expensive, and that the storage “tier” is something the application needs to be aware of (e.g., hot data on local disk, warm data on SAN, etc).
Two things have changed significantly here; storage is no longer as expensive as people are used to it being, and a whole new class of storage has become available to application developers and users alike, Object Storage.
How does Humio break these conventions?
We’re not going to give you all the details for what Humio does in these areas, but we can certainly discuss the general ways in which Humio reexamined these assumptions, and some of the results of doing so.
Indexes are not the solution
Indexing streaming data for the purposes of search is expensive, slow, and doesn’t result in a faster system for the kinds of use cases customers have for Humio. The interesting thing is that even the leading vendors of other data analytics platforms know this. They have had to work around this very problem to achieve acceptable solutions with things like “live tail” and “live searches”, etc. These index-based tools have to work around their own indexing latency to get the performance needed to claim “live” data … that should have been a big hint that maybe indexing wasn’t needed at all!
By moving away from the use of indexes (Ed: Humio still does actually index event timestamps, but we get the point), Humio does not have to do any of the processing and index maintenance that goes along with it. This means that:
- When data arrives at Humio it is ready for search almost immediately. We’re taking 100-300 ms between event arrival and that same event being returned in a search result (manual search or a live search that is already running, or an alert, or a dashboard update).
- Humio does not have to maintain indexes, merge them with new indexes, track which indexes exist, fix corruption in indexes, none of that. For those technologies that do rely on indexes, the indexes themselves become very large. Assuming the index is used to make the entire event searchable, indexing can make the data up to 300% larger than it was in its raw form.
- With Humio, all queries are against the same datastore; there’s no split processing between historical and live data. Now consider where indexing is used for “search” and some sort of live streaming query is used to power “live” views of the data: tools that take this approach will often show users a spike in a live dashboard, but the user cannot search those events in detail or even view them in the live view.
Find out more about how Humio’s index-free architecture from a blog post: How Humio’s index-free log management searches 1 PB in under a second.
Humio uses optimal compression algorithms to ensure minimal storage space is required (did I mention we don’t build indexes?); often achieving 15:1 compression against the original raw data, and in some cases exceeding 30:1 compression.
These compression algorithms allow for extremely fast decompression of the data. Humio analyses and organises incoming data so it can make use of techniques like compression dictionaries, meaning we can do this for the optimally-sized segment files in storage (i.e., we don’t have to build and access monolithic blocks of data to achieve high compression ratios).
This is a good original article to read to get some more background on the kinds of techniques Humio uses from Facebook Engineering: Smaller and faster data compression with Zstandard.
Find out more about Humio compression: Humio product page: Humio: Keep 5-15x more data, for longer.
The final piece of the puzzle here is getting access to the right data when a user issues a query. Humio can’t go scanning all the raw event content no matter how fast it might be. This is where the storage pattern that Humio utilises comes into the picture, and the heuristics for a node in the cluster to get access to the data and scan it.
Firstly, segment files are built around optimally-sized groups of data (some secret sauce is added here to make that happen effectively and transparently to the user). These segment files also have accompanying bloom filters built, which means Humio can quickly and effectively identify only the relevant segments for any given query.
The segments work really well on local or network-attached storage, and their size and nature make them an excellent fit for Object Storage.
What does a query pipeline typically look like?
- A query is issued against a Humio cluster. Humio identifies which segment files are relevant, based on the time range and scope of the query.
- The nodes that handle the query then fetch the relevant segment files for their part of the query job:
- First, check on the local storage/cache for the segment.
- Secondly, check the other nodes in the cluster for the segment.
- Finally, fetch the segment from the object storage.
- Complete the scan and return the results to the query coordinator.
Fun fact: Because the object storage can be so efficient, you can tell Humio to always fetch missing segments from the object storage rather than the other nodes in the cluster as that’s sometimes the fastest way to do things.
For more information on the Humio architecture, see this blog post that summarizes a presentation given by Humio CTO Kresten Krab Thorup: How Humio leverages Kafka and brute-force search to get blazing-fast search results.
Humio has reconsidered the problem of ingesting and searching log data. Through a new approach and new technologies that are available, it has built a solution that scales efficiently and performs better than the systems that have come before it, often by more than an order of magnitude in terms of speed, storage, and total cost of ownership.