Sharding Kafka for Increased Scale and Reliability

July 19, 2021

| Eric Schow | Engineering & Tech

How our engineering team overcame scaling limitations and improved reliability in our high-throughput, asynchronous data processing pipeline Apache Kafka is a high-throughput, low-latency distributed messaging system with support for multiple, de-coupled producers and consumers and arbitrary replay of events on a per-consumer basis up to a configured retention period. Originally developed at LinkedIn, it was open-sourced in 2011 and is now considered by many to be the gold standard for building real-time data pipelines and streaming apps. As a rapidly growing organization that processes in excess of 1 trillion events per day, Kafka is a natural choice for managing event flow in our highly asynchronous microservices architecture. Like many companies, our organization relied on one very large Kafka cluster to process incoming events. But even Kafka has its limits — having built one of the largest cloud-based platforms in the world, CrowdStrike would know. Since our introduction of Kafka into our platform, our event throughput has grown by two orders of magnitude to a peak of over fifteen million events per second on our largest cluster. We had expanded our largest cluster to over 600 brokers, with 500 GB of storage per partition on more than 800 partitions on our largest topic.

At this volume, we had reached the scaling limits for individual topics on a single Kafka cluster.

Further horizontal scaling of individual clusters had become both problematic and impractical. As the number of brokers grows, the number of connections per broker required to manage replication becomes unwieldy and impacts performance. In addition, the operational cost of adding and removing brokers increases as the number of brokers grows and also with the size of each partition. So how do we find a solution that addresses the core challenges of scalability and reliability and also gets around tangential issues such as cost and efficiency? In this post, we explore how our team manages scaling limitations in Kafka without compromising reliability. We will also discuss some key considerations that organizations must review prior to implementing a similar solution.

Our Solution: The Benefits of a Sharded Approach

As noted above, our team needed to develop a solution that addressed two main requirements:

Overcome the scaling limitations of a single Kafka cluster
Maintain or improve the reliability of our high-throughput, highly available, asynchronous data processing pipeline

Our solution was to split our large existing cluster into multiple smaller clusters. For the purposes of this post, we refer to the smaller clusters as shards, though we acknowledge that they are not shards in the traditional sense. These are independent Kafka clusters, each with an identical set of topics, that we are conceptually sharding when we read and write data. The addition and removal of shards is managed via configuration. Our standard producer and consumer were updated to support one of three shard configurations:

Active (the shard is ready for reads and writes)
Read-only (producers ignore read-only shards)
Inactive (ignored by both producers and consumers)

In this way, new shards can be added seamlessly as throughput or as storage thresholds are reached, and can also be easily removed for maintenance with minimal operational overhead. Sharding afforded our team three main benefits:

Infinite scaling. By sharding our cluster, our team can enable nearly infinite horizontal scaling of Kafka production and consumption. The performance cost of attaching consumers to additional shards has been negligible, and the monetary cost of each additional shard actually decreases due to reduced overhead requirements (see “Cost” consideration in the next section).
Ease of maintenance. Once we shard the main cluster, the resulting shards have enough headroom to collectively absorb the storage and throughput of any one of the individual shards. This means that our team could easily complete maintenance tasks with no downtime by shifting the traffic from one shard to the others. Because of this capability, our team can avoid working on live shards, which is more time-consuming and has the potential to impact production by increasing latency and decreasing throughput.
Fault reliability. Because our solution incorporates multiple shards that could collectively absorb one of the other shards, we now have a quick failover in the event that one shard becomes unavailable for any reason. For example, if multiple hardware failures rendered a shard inoperable, we can take that shard out of production via a configuration change and seamlessly route the current traffic into the remaining shards. In the meantime, our organization can continue to operate as usual.

Key Considerations for a Sharded Solution

For organizations that are facing a similar scalability issue and may be considering a similar solution, we offer the following considerations:

Coordinating data. When shifting traffic from one shard to another, it is important to coordinate activation of the producer and the consumers to ensure you are not dropping data. As noted above, our solution classifies shards in one of three states: active, read-only and inactive. If one shard were to immediately go from active to inactive, recent data could be orphaned on the inactive shard. Our solution utilizes the read-only state to ensure that consumers can drain data before turning off reads.
Cost. Since this solution requires multiple shards with ample collective headroom to absorb another shard, there is a cost implication. The organization must run with enough capacity to support an additional storage and throughput factor of 1/(N -1), where N is the number of shards. For example, a four-shard setup would require running each shard at no more than ⅔ capacity to temporarily absorb ⅓ of the traffic of any inactive shard. Thus, as additional shards are added, the required overhead per shard decreases.
Data segregation. It is tempting to consider using shards to segregate data on some criteria, such as customer, source or other arbitrary qualifier. However, that goes against the goals of this project, as the data would no longer be mixable. Instead, we use different topics to segregate data. To support failover as described here, the data on one shard should be indistinguishable from the data on any other shard.

Like most organizations, our engineering team expects our data stream to continue to grow daily. By sharding our Kafka clusters as described above, we expect to be able to achieve nearly limitless horizontal scaling — all without compromising reliability or introducing any undue complexity and cost within the engineering organization.

Developing a creative and effective solution takes a team effort. We thank all members of our engineering team who contributed to this project, including: Daniel Smith, Brian Gallew, Duy Nguyen, Mezzi Sotoodeh and Roger Clermont. Have questions or comments? We’d love to hear how your organization is managing scaling limitations and the rationale for your solution. Tag @CrowdStrike on social media and share your thoughts.