Apache Kafka Tiered Storage

Apache Kafka’s Tiered Storage is a data management strategy that categorizes storage into two tiers i.e. local and remote. Data moves asynchronously from local tier to remote based on set retention values, allowing storing huge volumes of data efficiently at reduced costs.  

Background Knowledge 

In the typical Apache Kafka cluster, each broker has some local storage attached to it and data in the topics is stored only in local storage on these brokers. There are a few issues with this approach. As the amount of data increases, the local storage needs to be increased. Without support of tiered storage,  a Kafka cluster’s storage can be scaled by adding more broker nodes or replacing current nodes with higher capacity nodes. It is not a cost-effective way of increasing storage. Adding new nodes would also require copying a lot of data that makes operations difficult and time consuming.

With Tiered Storage in Apache Kafka, the storage is categorised into two tiers, local where the most recent data can be stored and remote storage where historical data can be archived. The references to remote data are stored in the broker so when needed, it can quickly be retrieved by Kafka from the remote tier. The tiered storage approach offers many benefits, and a few noteworthy ones are: 

  1. Cost Optimization: Data can be stored based on the specific needs of the data. For example, high performance, expensive storage can be used for latency sensitive applications and lower cost, slower storage for less frequently accessed data. This helps in reducing overall storage costs.
  2. Scalability: With Tiered Storage, Kafka cluster can be scaled more efficiently by scaling compute and storage independently. This allows increasing data volumes to be stored by expanding the remote storage rather than adding new broker nodes just for storage capacity. The new node added for additional computing power also takes less time to catch up, since the remote storage is shared across all nodes.  

How it works  

In the tiered storage approach, there are two tiers of storage – local and remote. Local is usually the faster, more expensive storage and remote could be slower, cost-effective storage. Enablement of Tiered Storage on Apache Kafka cluster does not change the way producers and consumers interact with each other, it only impacts the data retention and data retrieval process.  

Data retention and segment management 

When the producers write to a Tiered Storage enabled topic, the data is stored in local storage as normal. The data is organized into segments. The local storage uses the local disks to store segments on the Kafka brokers. For Tiered Storage enabled topic, there are additional retention settings that set retention threshold and how long the data stays in each tier. Size of the segment also influences the data retention capacity. Depending on the local retention settings, the segments are transferred asynchronously to the remote storage and the leader creates and saves the metadata of the remote object in internal topic. This metadata is then used to build the remote references within the broker that keep track of the data.

In situations such as high data ingestion or temporary errors like network connectivity issues, the local storage may temporarily exceed the specified local retention threshold, resulting in an accumulation of additional data in local tier. The log cleaner will not remove this data until it has been successfully uploaded to remote storage. 

Data retrieval by consumer 

For data retrieval consumer requests on Tiered Storage enabled topics, if the data is available in local storage, it is served from the local disk. If the requested data sits in the remote storage, the broker streams the data from it into its in-memory buffer (and on-disk cache), and then sends it back to the client.  

Limitations of Tiered storage 

As Kafka 3.6.x, the following are the limitations of Kafka’s Tiered Storage. For the most current list, please refer directly to the Kafka project’s pages on Tiered Storage here. 

  • Tiered Storage is not supported for compacted topics.  
  • Multiple Log dirs on a broker are not supported (JBOD related features). 
  • If you enable Tiered Storage for a topic, you cannot deactivate it without first contacting the support team. 
  • Increasing the local retention threshold won’t move segments already uploaded to remote storage back to local storage. This change only affects new data segments. 

Please see our support documentation here. 

To provision an Instaclustr for Apache Kafka cluster with Tiered Storage enabled, please see Creating an Apache Kafka Cluster. 

Questions 

Please contact [email protected] for any further inquiries. 

By Instaclustr Support
Need Support?
Experiencing difficulties on the website or console?
Already have an account?
Need help with your cluster?
Contact Support
Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console
Spin up a cluster in minutes