Apache Kafka provides developers with a uniquely powerful, open source and versatile distributed streaming platform – but it also has some rather complex nuances to understand when trying to store and retrieve data in your preferred order.

Kafka captures streaming data by publishing records to a category or feed name called a topic. Kafka consumers can then subscribe to topics to retrieve that data. For each topic, the Kafka cluster creates and updates a partitioned log. Kafka sends all messages from a particular producer to the same partition, storing each message in the order it arrives. Partitions, therefore, function as a structure commit log, containing sequences of records that are both ordered and immutable.

As Kafka adds each record to a partition, it assigns a unique sequential ID called an offset.

Offset - unique sequential ID

Because Kafka always sends real-time data to consumers in the order that it was stored in the partition, retrieving data from a single partition in your preferred order is simple: all you have to do is store it in the order you’d like it in the first place. But Kafka makes things significantly more complicated by not maintaining a total order of records when topics have more than one partition.

Anatomy of a topic - Kafka Data

If your application requires total control over records (and being limited to a single consumer process per consumer group is no problem), using a topic with just one partition might be your best solution. For most applications that level of control isn’t necessary, however, and it’s better to use per-partition ordering and keying to control the order of retrieved data.

Using Kafka Partitioning and Keying to Control Data Order

The following example demonstrates the results of storing and retrieving data in multiple partitions.

To begin, let’s make a topic called “my-topic”, that has 10 partitions:

Kafka topic with 10 partitionsNow let’s create a producer, which we’ll have send the numbers 1 to 10 to the topic in order: 

Kafka producer - send the numbers 1 to 10 to the topic in order

The topic stores these records within its ten partitions. Next, we’ll have the consumer read the data back from the topic, starting at the beginning:

Consumer reading the data back from the topic, starting at the beginning

The data is out of order. What’s happening is that the Kafka consumer retrieves the data in a round robin fashion from all ten partitions:

Kafka consumer retrieving data in a round robin fashion

Let’s switch to a new example, where we create a topic with a single partition:

Kafka Topic with single partition

And send the same data, 1 to 10:

Send Kafka data - 1 to 10

Now when we retrieve the data, it remains in the order that it was originally sent:

Data sent round robin to all kafka partitions

This example proves that Kafka will in fact guarantee data order within partitions.

Now let’s explore how to use keying by adding keys to producer records. We’ll create four messages, each including one of four different keys (in our example: Costco, Walmart, Target, and Best Buy), and send them to a topic with two partitions:

Use keying by adding keys to producer records

The four keys are hashed and distributed evenly into the partitions:

Keys hashed and distributed evenly into kafka partitions

Let’s see what happens when we send four more messages, using the same keys:

4 new messages using the same keys

Kafka sends all further keyed messages to the partitions using those keys:

Kafka sending all further keyed messages to the partitions using those keys

The records are stored in the partitions in the order they were sent.

Now we’ll add more partitions to the cluster, which can offer a healthier balance of data across partitions:

Additional Kafka partitions

Triggering a rebalance event will then redistribute the records across the partitions:

Triggering a rebalance event redistributes the records across the partitions

As we can see, all records with the Best Buy key have been balanced to Partition 3, and the data is otherwise still nicely structured by key. We’ll demonstrate this further by adding four more messages:

Records balanced after additions of more partitions

New records are organized into partitions according to their keys. Given that we have four keys and the partitions our unbalanced, it’s logical to add another partition and trigger a rebalance:

Adding another partition and trigger a rebalance

The data sets, partitioned by key, are now in a healthy balance.

How to Make Sure Kafka Always Sends Data in the Right Order

A number of circumstances can lead to Kafka data order issues, from broker and client failures to reattempts at sending data leading to errors. Dealing with these issues requires a thorough understanding how the Kafka producer functions.

Here’s an overview:

Overview of how Kafka producer functions