If you are a frequent visitor to sites like HackerNews or Lobsters or follow new software development technologies, you may have read something about a messaging framework developed at LinkedIn called Kafka. Kafka is a very popular distributed message queue that is used as the backbone of many applications and services throughout the world. I’ve been trying to learn as much as I can about Kafka for an upcoming project at work, and so I thought I would try to write a post describing the basics of how it is used and works that will hopefully help others that are just getting started with Kafka, or at least would like to learn more about it.
Note: Kakfa is by no means Ruby specific. It is actually written in Java and Scala but has clients in many languages for interoperability. The second part of this series will cover specific Ruby clients to allow you to use Kafka in your Ruby applications.
What.. is it?
As described in the Kafka docs, Kafka is a high-throughput distributed messaging system rethought as a commit log. Each commit, or message in Kafka’s terminology, is immutable and represents a point in history that cannot change. Incoming messages are written to what is called the log. More on that later.. but first lets talk about why/when you would want to use Kafka in your applications.
As mentioned earlier, Kafka is a distributed messaging queue. Distributed messaging queues are mostly used to decouple producers and processors of data and have become quite popular in the microservice world where each service should have a single responsibility. In the past, when monolithic architectures were all the rage, there wasn’t as great of a need for messaging systems since most message passing was simply a call to another method which passed data in the application’s memory.
Now, with the trend to more distributed architectures, data still needs to flow between parts of the application, however it can no longer be transmitted in memory from one component to another as each component (or service) may live on different machines in different parts of the world. This is where messaging systems like Kafka come in. They act as the transportation system for this data (or messages) to flow and also usually include several features that help make all of this distributed communication easier to manage.
The Kafka docs go on to describe many of Kafka’s use cases, which hopefully show you the versatility of Kafka.
In short, Kafka can be used for:
- Messaging - useful to decouple the processing of data from the production of that data.
- Website Activity Tracking - the original use case for Kafka while being developed at LinkedIn.
- Metrics - aggregating metric data like what is produced from software such as StatsD.
- Log Aggregation - since Kakfa is described as a distributed commit log, it makes sense that it can also be used to store application log data.
- Stream Processing - data processing where data must be consumed, aggregated, and/or transformed to provide value.
- Event Sourcing - an interesting style of system design described by Martin Fowler in which an application’s state is made up of a collection of events.
Before we go too much further in depth into how Kafka works, I would first like to define some key terminology that makes up the Kafka architecture.
Producers are rather simple in that they are basically the processes that publish messages to the Kafka broker, or cluster of servers that makes up the Kafka system.
These messages can be serialized by the producer in a wide array of formats including:
When sending messages to the Kafka broker, producers must specify the topic to which to send these messages.
Here’s an example producer that wraps the Poseidon Kafka client that we will cover more in part 2:
Topics are one of the main abstractions when working with Kafka. Topics represent the feeds of messages inside Kafka. Think of topics as a broadcast channel where messages are sent to be consumed (more on this later). Topics have names in order to differentiate themselves from one another.
For example, inside a Kafka system designed to keep track of web analytic data, there may be a topic named ‘clicks’ where all of the user generated click messages are sent. There can be any number of topics in a Kafka system, however as we’ll see shortly, there is another way that messages are distributed, partitions.
For each topic, Kafka maintains a partitioned log or partition. Each partition maintains an ordered sequence of messages that can be appended to. Note the importance of ordering as this is one of the key benefits of using a system like Kafka. Messages produced to a given partition are guaranteed to maintain the order in which they were received. This can be very useful for our ‘click’ topic for example if we choose to setup partitioning based on user_id. Kafka uses a hashing algorithm to guarantee that user 0 will always map to partition A, and user 1 to partition B. This means that if we want to consume all of user 0’s data, we just need to consume partition A and are guaranteed that the messages are in the order that they were received.
Each message is assigned an offset in the partition. The three partitions below each have a different number of messages as shown by their offsets:
Partitions are also important in that they allow Kafka topics to scale across multiple machines. This is a requirement because in many cases the amount of data produced to a topic may not be able to fit on a single machine. Kafka flushes all messages immediately to disk once they are received, and per the Kafka docs: “The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time”, which means this data could grow very large for a particularly chatty topic.
In this example, our ‘click’ topic is actually split up into three partitions on three different machines. If you think about this for a moment, this design is actually very powerful since each partition of a topic may reside on a different machine, this allows our consumers to scale linearly as well. This means that the number of partitions impacts the maximum parallelism of consumers as you cannot have more consumers than partitions.
Consumers are the processes that subscribe to topics and process messages sent. More specifically consumers consume messages from topic partitions. Because consumers keep track of which partitions they are consuming from, they are also responsible for keeping track of the offset of the last message consumed within the log. This gives the consumer a lot of flexibility in that the consumer can consume messages in near real-time by always consuming from the latest offset, or it can replay previously consumed messages by setting the offset to an earlier one.
Both consumers and producers can be written in any language that has a Kafka client written for it. This means that your producers could be written in Java, while your consumers are written in Ruby or Go (or vice-versa).
There is still much more to cover when it comes to an intro to Kafka. In the second part of this post, we will cover some of the major benefits of Kafka such as it’s durability and fault tolerance. We will also go over how consumers are organized into consumer groups and how this allows Kafka to operate as either a queue or pub-sub system. Finally, I will go over the most popular Kafka clients for Ruby and talk about how Kafka might be able to fit into your next application.
Thanks for taking the time to read this first part. As always, if you have any questions or comments feel free to reach out in the comments, on Twitter or via email.