Apache Kafka cluster: Key components and building your first cluster
An Apache Kafka cluster is a distributed system for handling large volumes of real-time data streams.
What is an Apache Kafka cluster?
An Apache Kafka cluster is a distributed system for handling large volumes of real-time data streams. It consists of multiple Kafka brokers that work together to manage and distribute data across the cluster. Each broker in the cluster stores a portion of the data and participates in balancing the load to ensure high availability and reliability.
Kafka clusters are typically deployed to enable data ingestion, storage, and processing across various applications and systems. The main components of a Kafka cluster include producers, consumers, topics, and partitions.
Producers publish data to Kafka topics, which are logical channels for organizing data. Consumers subscribe to these topics and process the data accordingly. Topics are divided into partitions, which allow Kafka to distribute the data across multiple brokers, enabling parallel processing and efficient data management.
Features of Kafka clusters
Apache Kafka clusters typically offer the following capabilities:
Scalability
Kafka clusters are highly scalable, capable of handling vast amounts of data by adding more brokers to the cluster. This linear scalability ensures that the performance of the cluster can grow with the increasing demands of data ingestion and processing. Kafka achieves this by partitioning topics, allowing data to be distributed and processed in parallel across multiple brokers.
As data volume or the number of client applications increases, additional brokers can be added to distribute the load. This scalability also extends to consumer groups, which can scale out to read from different partitions simultaneously, further enhancing throughput and processing efficiency. By decoupling producers and consumers, Kafka allows each to scale independently.
Fault tolerance
Kafka clusters replicate data across multiple brokers to ensure that there is no single point of failure. Each partition of a topic can have multiple replicas, distributed across different brokers. If a broker fails, the remaining brokers in the cluster can continue to serve the data, ensuring high availability.
This replication mechanism provides reliability, making Kafka suitable for critical data streaming applications. Kafka uses a leader-follower model for replication. The leader handles all read and write requests for the partition, while followers replicate the data. If a leader broker fails, one of the followers automatically takes over as the new leader, minimizing downtime and data loss.
Durability
Durability in Kafka is achieved through its distributed log storage. Messages are persisted to disk as soon as they are produced, and they remain available until explicitly deleted. This ensures that data is not lost even in the event of broker failures. Kafka’s use of a commit log guarantees that data can be re-read and replayed, providing a reliable mechanism for long-term data retention and consistency.
Data durability is further enhanced by configurable retention policies, allowing users to specify how long data should be stored. Kafka also supports compaction, which retains only the latest value for each key within a topic, ensuring efficient storage management without sacrificing data integrity.
Real-time stream processing
Kafka enables the continuous ingestion and processing of data streams with minimal latency. It supports various stream processing frameworks, such as Apache Flink and Apache Spark, which can consume data from Kafka topics and perform complex transformations and computations in real time.
This capability makes Kafka suitable for applications that require immediate data insights and responsiveness, such as fraud detection, monitoring systems, and event-driven architectures. Kafka Streams, a stream processing library, allows developers to build data processing pipelines directly within the Kafka ecosystem. It simplifies the creation of stateful and stateless transformations, aggregations, and windowed operations.
Related content: Read our guide to Kafka management
Kafka cluster architecture and components
Here’s an overview of the main concepts in the Kafka cluster architecture.
Topics
Topics in Kafka serve as logical channels for organizing and categorizing data streams. Each topic represents a particular stream of data, such as logs, transactions, or events, and helps in separating different kinds of data within the Kafka ecosystem. Topics are further divided into partitions, which are the fundamental units of parallelism in Kafka.
Partitions enable Kafka to scale horizontally by distributing the data load across multiple brokers. This partitioning allows consumers to read data in parallel, improving throughput and performance. Each partition in a topic is ordered, and messages within a partition are assigned unique offsets, which act as sequential identifiers, for ensuring that consumers can process messages in the correct order and resume processing in the event of a failure.
Kafka topics support various configurations, including replication factors and retention policies, which define how long data is stored before it is deleted. This allows users to tailor their Kafka topics to specific use cases.
Brokers
Brokers are responsible for handling all data storage and retrieval operations. Each broker in a Kafka cluster manages multiple partitions from various topics, distributing the data load evenly across the cluster. This distribution is crucial for achieving Kafka’s high scalability and fault tolerance.
Brokers communicate with producers to receive data and with consumers to deliver data. They also handle replication of partitions to ensure data durability and availability. Each partition has one broker acting as the leader, responsible for all read and write operations, while the other brokers act as followers, replicating the data from the leader.
Brokers maintain metadata about the cluster state, such as which partitions they lead or follow, and coordinate with KRaft (or ZooKeeper in legacy deployments) to manage this metadata. This helps maintain the cluster’s health and performance. Kafka brokers also support configurable settings for storage, such as log retention policies, compression, and quota management.
KRaft
KRaft (Kafka Raft) is Kafka’s built-in consensus protocol, which replaces ZooKeeper for managing metadata in the cluster. This transition improves Kafka’s scalability and simplifies its architecture by eliminating the need for an external system like ZooKeeper. KRaft handles leader elections, manages metadata, and ensures data consistency directly in Kafka brokers.
In a KRaft-based cluster, the metadata is stored in an internal Kafka topic, and brokers communicate through the Raft consensus algorithm. Raft ensures that there is always a leader broker, which is responsible for managing the metadata and coordinating with follower brokers to maintain consistency.
This internal handling of metadata allows Kafka to scale more efficiently, as it reduces the operational complexity associated with managing a separate ZooKeeper cluster. Additionally, KRaft improves failover times by simplifying the process of leader election when a broker fails, improving the resilience of Kafka clusters.
Producers
Producers are clients that send data to Kafka topics. They publish records to the specified topic, where the data is partitioned and stored across the brokers. Producers can handle high throughput with low latency, making them suitable for real-time data ingestion applications. They support configuration options, such as specifying the partitioning logic, batching multiple records into a single request, and using compression to optimize performance.
Producers ensure data delivery reliability through acknowledgment settings. They can be configured to wait for acknowledgments from brokers before considering a record as successfully sent. This acknowledgment mechanism can be adjusted to wait for acknowledgments from the leader broker or from all replicas.
Producers also handle retries and error handling, ensuring that temporary network issues or broker failures do not result in data loss. By decoupling data production from data consumption, Kafka producers enable seamless integration with various data sources, such as log files, databases, and real-time event streams.
Consumers
Consumers are clients that read data from Kafka topics. They subscribe to one or more topics and process the data in real time. Consumers track their position in each partition using offsets, which they commit to Kafka to mark the messages they have processed. This allows consumers to resume processing from their last committed offset in the event of a failure.
Kafka supports consumer groups, allowing multiple consumers to share the load by processing different partitions of a topic. This provides scalability and fault tolerance, as each consumer in a group can operate independently. If a consumer fails, its partitions are automatically reassigned to other consumers in the group, ensuring that data processing continues without interruption.
Consumers can be configured to read data in real time or to read historical data for batch processing. Kafka provides client libraries for different programming languages, enabling developers to integrate Kafka consumers into a range of applications.
Tips from the expert
Andrew Mills
Senior Solution Architect
Andrew Mills is an industry leader with extensive experience in open source data solutions and a proven track record in integrating and managing Apache Kafka and other event-driven architectures
In my experience, here are tips that can help you better manage and optimize your Apache Kafka cluster:
- Optimize partition distribution: Ensure that partitions are evenly distributed across all brokers. Imbalances can lead to some brokers being overworked while others remain underutilized. Use tools like
kafka-reassign-partitions.sh
to redistribute partitions if needed. - Tune producer and consumer settings: Fine-tune producer and consumer configurations like batch size, linger time, and fetch size based on your workload characteristics to optimize performance and reduce latency.
- Use schema registry: Integrate a schema registry to manage the schemas of the data being produced and consumed. This helps ensure data compatibility and provides a centralized location for schema evolution.
- Monitor disk usage closely: Kafka’s performance is heavily dependent on disk I/O. Use monitoring tools to keep an eye on disk usage and set up alerts for any spikes. Ensure that your disks are fast and capable of handling the write/read loads.
- Adjust retention policies: Carefully configure your data retention policies to balance between data availability and storage cost. Retention settings should align with the business requirements for data availability and the capacity of your storage system.
Tutorial: Setting up Kafka clusters
This guide provides a step-by-step process to get a Kafka cluster up and running. It is adapted from the official Kafka documentation.
Download Kafka
First, download and extract the latest Kafka version:
1 2 |
$ tar -xzf kafka_2.13-3.7.1.tgz $ cd kafka_2.13-3.7.1 |
Launch the Kafka Environment
Ensure the local environment has Java 8+ installed. Kafka can be started with either ZooKeeper or KRaft. ZooKeeper is a legacy component, and KRaft is recommended for new deployments.
Kafka can be started with KRaft using downloaded files or a Docker image.
Here is how to start Kafka with KRaft:
- Generate a Universally Unique Identifier (UUID) for the cluster:
1$ KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
- Format the log directories:
1$ bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
- Initiate the Kafka server:
1$ bin/kafka-server-start.sh config/kraft/server.properties
Create a topic to store events
To create a new topic that can store events, run the following in a new terminal session:
1 |
$ bin/kafka-topics.sh --create --topic my-events --bootstrap-server localhost:9092 |
To view details about the new topic:
1 |
$ bin/kafka-topics.sh --describe --topic my-events --bootstrap-server localhost:9092 |
Write events into a topic
Run the console producer client to write events into the topic:
1 |
$ bin/kafka-console-producer.sh --topic my-events --bootstrap-server localhost:9092 |
Enter the events:
1 2 |
This is the first example event This is the second example event |
To stop the producer client, use Ctrl-C
.
Read the events
Run the console consumer client to read the events:
1 |
$ bin/kafka-console-consumer.sh --topic my-events --from-beginning --bootstrap-server localhost:9092 |
To stop the consumer client, use Ctrl-C
.
Import and export data using Kafka Connect
Kafka Connect allows users to integrate existing systems with Kafka. Edit the config/connect-standalone.properties file to add the plugin path:
1 |
plugin.path=libs/connect-file-3.7.1.jar |
To generate seed data, use:
1 |
echo -e "foo\nbar" > test.txt |
On Windows, use:
1 2 |
echo foo> test.txt echo bar>> test.txt |
Start two connectors in standalone mode:
1 |
bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties |
Verify the data pipeline by examining the contents of the output file:
1 |
more test.sink.txt |
It is also possible to read the data in the topic using the console consumer:
1 |
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning |
Process events with Kafka Streams
Kafka Streams can process data stored in Kafka. Here’s a basic example of filtering events based on a condition:
1 2 3 4 |
KStream<String, String> events = builder.stream("my-events"); KStream<String, String> filteredEvents = events .filter((key, value) -> value.contains("example")); filteredEvents.to("filtered-events-topic", Produced.with(Serdes.String(), Serdes.String())); |
This example filters the events in the my-events topic to include only those messages that contain the word example and then writes the filtered events to a new topic named filtered-events-topic.
Terminate the Kafka environment
Use Ctrl-C
to stop the producer and consumer clients, and then to stop the Kafka broker and ZooKeeper server.
To delete any data from the local Kafka environment, run:
1 |
$ rm -rf /tmp/kafka-logs /tmp/zookeeper /tmp/kraft-combined-logs |
Harnessing the power of Apache Kafka with Instaclustr managed platform
Instaclustr, a leading provider of managed open source data platforms, offers a powerful and comprehensive solution for organizations seeking to leverage the capabilities of Apache Kafka. With its managed platform for Apache Kafka, Instaclustr simplifies the deployment, management, and optimization of this popular distributed streaming platform, providing numerous advantages and benefits for businesses looking to build scalable and real-time data pipelines.
Instaclustr takes care of the infrastructure setup, configuration, and ongoing maintenance, allowing organizations to quickly get up and running with Apache Kafka without the complexities of managing the underlying infrastructure themselves. This streamlines the adoption process, reduces time-to-market, and enables organizations to focus on developing their data pipelines and applications.
Instaclustr’s platform is designed to handle large-scale data streaming workloads, allowing organizations to seamlessly scale their Kafka clusters as their data needs grow. With automated scaling capabilities, Instaclustr ensures that the Kafka infrastructure can handle increasing data volumes and spikes in traffic, providing a reliable and performant streaming platform. Additionally, Instaclustr’s platform is built with redundancy and fault tolerance in mind, enabling high availability and minimizing the risk of data loss or service disruptions.
Organizations can leverage the expertise of Instaclustr’s engineers, who have deep knowledge and experience with Kafka, to optimize their Kafka clusters for performance, reliability, and efficiency. Instaclustr provides proactive monitoring, troubleshooting, and performance tuning, ensuring that organizations can effectively utilize Kafka’s capabilities and identify and resolve any issues promptly.
Instaclustr follows industry best practices and implements robust security measures to protect sensitive data and ensure compliance with data privacy regulations. Features such as encryption at rest and in transit, authentication and authorization mechanisms, and network isolation help organizations safeguard their data and maintain a secure Kafka environment.
For more information: