Apache Flink® and Apache Kafka® Streams are two names that continually pop up when talking about data streaming and stream processing, but at times it’s not exactly clear how these technologies are related–if at all.
This blog will help clarify how these technologies work, their pros and cons, and what use cases are the most appropriate for each.
What is Apache Flink?
Apache Flink is a stream processing framework designed to operate over huge amounts of data and can distribute this load across multiple servers and process it in parallel.
Flink is designed to run at massive scale, processing enormous amounts of real-time data while remaining fault tolerant and still guaranteeing “exactly once” event handling. It can be used for real time event driven applications, data analytics, rules engines and machine learning, just to name a few.
Apache Flink runs as a dedicated cluster comprising a JobManager and a number of TaskManagers, or workers. A Flink application is written in Java, Scala, Python or SQL and then submitted to the JobManager, which will then schedule the work with available TaskManagers performing the work.
Pros of Apache Flink
The biggest benefit of using Flink as your stream processing solution is the speed and latency that a correctly configured cluster can provide. Flink has been designed to perform large scale, parallel stream processing and is able to scale to satisfy the most demanding requirements.
Tasks in Flink are fault tolerant and able to recover from failure without sacrificing accuracy.
Flink has been designed to operate in any modern environment. It has native integrations with systems such as Hadoop™ YARN, Apache Mesos™ and Kubernetes in addition to running as a standalone cluster.
Additionally, Flink has built a library of support for input sources and output sinks. This allows it to integrate into most stream processing architectures, including support for databases, filesystems and other event streaming technologies like Apache Kafka.
Cons of Flink
Flink clusters can be difficult to configure; setting one up requires dedicated infrastructure that needs to be maintained throughout the life of its application. Monitoring and support are another layer of complexity that needs to be managed by a dedicated team.
To get the most of a Flink application, users are generally required to have specialized knowledge in the technology to fully take advantage of all the benefits a Flink application can provide. This can make adoption of Flink slower compared to some of its counterparts, as developing these skills can take time if they’re not readily available on your team.
What is Kafka Streams?
Kafka Streams is a stream processing client library used in conjunction with Apache Kafka, with the latter being used as the source and destination for the data.
A Kafka Streams application is written in Java and is deployed as a standalone application within your environment. Best of all, it requires no additional hardware outside of the Apache Kafka cluster.
The Kafka Streams API is implemented on top of the standard Kafka client. It leverages the concept of the Kafka consumer group, allowing Streams applications to scale horizontally with the available partitions in your Kafka topic.
Streams applications are fault tolerant, and the state of each running task is maintained. This allows a Streams application to recover from failure seamlessly.
By using the streams API, a Kafka user can write concise, fluent code that can perform calculations and transformations on the data stored in a Kafka cluster.
Pros of Kafka Streams
In environments where Apache Kafka is already in use, implementing a Streams application is a straightforward task. Streams is built into the Kafka client library, and anywhere you are already consuming data from a Kafka service can also use the Streams API.
As mentioned, Apache Kafka Streams applications work within the same framework as regular Kafka applications, and do not require any additional data configuration or partitioning.
Cons of Kafka Streams
One key downside is that the Streams API is limited to Java, with support for other JVM languages via the Java API.
When writing Kafka Streams applications, developers need to ensure they are sufficiently decoupled from other business logic. The CPU and scaling requirements for a stream processing application can be vastly different than that of a simple Kafka consumer, and careful isolation should be considered.
Key differences between Flink and Kafka Streams
Now that we’ve had a look at some of the key features of Apache Flink and Kafka Streams, let’s analyze how they are different in the following dimensions:
Architecture and Deployment
These are the primary differences between the two technologies.
An Apache Flink cluster requires a dedicated set of hardware infrastructure, comprising JobManagers and workers. This pool of resources can be scaled as required but will require additional maintenance from the infrastructure team managing the deployment.
In contrast, the Kafka Streams API is built into the Apache Kafka client and can run in any environment that is already consuming from a Kafka cluster.
Kafka Streams does require an Apache Kafka cluster. If you don’t have one in your company already, it has many of the same requirements that an Apache Flink cluster has.
While Flink applications support Java, Scala and Python and SQL, the Kafka Streams API is only available in Java.
Accessibility and Complexity
Apache Flink offers three different APIs, each tuned to different levels of conciseness and expressiveness, allowing users to be as complex as required. They are:
- ProcessFunctions
This low-level API allows users to build complex stream processing pipelines out of the basic building blocks within Flink and gives fine-grained control. - DataStream
Contains several built-in operations for stream processing like data windowing, which makes building stream processing applications straightforward and accessible. - SQL & Table
This API is built specifically to enable data engineers and data scientists to operate within the Flink ecosystem without having to use programming languages like Java or Python.
Users of this API can write fully SQL-compliant queries over streaming or batch data.
The Kafka Streams API is a fluent client library offered for Java as part of the Apache Kafka client library. It is comparable to the Flink DataStream API, which both share a fluent design and common operations over streaming data such as map(), reduce(), and aggregate().
When to use Kafka Streams instead of Flink?
If you are an existing user of Apache Kafka and want to build fault tolerant and stream processing applications, then Kafka Streams API is an obvious choice.
The Streams API is an easy-to-use, fit-for-purpose tool that can be used to build large-scale, real-time processing applications and will require no additional infrastructure.
However, if you are not already using Apache Kafka, then the choice becomes more nuanced.
If you aren’t using any message broker or data streaming platform, Apache Kafka is an excellent open source, free-to-operate solution that can operate at any scale and is used across the industry. Combined with the Kafka Streams API, it is a low-cost entry to the world of stream processing applications. (And once Apache Kafka is being used in your organization, you will soon wonder how you lived so long without it.)
Finally, if you decide your Kafka Streams application is no longer the right solution, Apache Flink can be used in conjunction with an Apache Kafka cluster to source the data for your Flink application.
Final Thoughts
Both Apache Flink and the Kafka Streams API are fault tolerant streams processing engines that can be leveraged to build highly performant solutions at scale.
The question of which is right for you comes down to how complex and mature your stream processing requirements are and your ability to onboard new technologies and infrastructure into your organization.
There is no clear “winner” or better option–it really all comes down to your specific use case. If you select either of these solutions you can rest assured that they are fully capable, open source solutions with active communities using them and supporting them for the foreseeable future.