Introduction
In this blog, we give a brief overview of the different data and privacy regulations, and some of the common themes that emerge from them. Then we discuss the challenges posed by the architecture of Apache Kafka in meeting these regulatory requirements.
Organizations need to evaluate and design their end-to-end data pipeline on a case-by-case basis and are responsible for any third parties that consume an individual’s data. This is often achieved by data and privacy agreements with those third parties.
This article is not intended to provide legal advice and you should consult with your own legal or compliance advisors to ensure you are complying with applicable laws and regulations.
Data and Privacy Regulations
The general trend in the new data economy has caused an increase in regulation for protection of personal data and privacy. A wide-ranging framework is the General Data Protection Regulation (GDPR), which may apply to an organization that holds data on EU individuals regardless of the location of the organization.
Other regulatory frameworks include, but are not limited to, the Australian Privacy Principles (APP) contained in the Privacy Act of Australia and the California Consumer Privacy Act (CCPA). With the growth in data and applications that process said data, it is possible that more and more countries will introduce similar regulations to protect individual rights.
While the legalities surrounding each of the frameworks are different, in general, they enable data protection laws to safeguard the personal data of individuals (data subjects). They aim to give control to individuals over their personal data and to standardize the data and privacy regulatory environment for businesses within their jurisdiction. Please refer to the regulations applicable in your jurisdiction to see how they may impact your individual case. The following should not be taken as legal advice in dealing with any specific regulation.
The common themes we found across these frameworks are noted below:
- Protecting Customer Data
- Organizations must ensure that personal data is processed securely using appropriate technical and organizational measures.
- Deletion Rights
- Data subjects have the right to request the deletion of their personal data as per the conditions laid out by the applicable laws. There are exceptions, but in general, they can exercise their right to delete certain data that an organization may hold about them.
- Right to Access Information
- Data subjects can request access to their personal data and information about how it is being processed.
- Right to Correct Information
- Data subjects can request corrections to their personal data if it is inaccurate or incomplete.
- Consent
- Organizations must obtain clear and explicit consent from data subjects before processing their personal data.
The Challenges in Apache Kafka
Apache Kafka’s core design is optimized for high-throughput and low-latency data streaming rather than data storage and security. If data is stored in Kafka for long durations, then maintaining a data and privacy compliant Kafka system can involve significant overhead, especially in a large-scale deployment requiring long data retention.
Further, when data flows through Kafka, the organizations responsibility does not end there—they are responsible for ensuring that the down streamed applications that consume the data, including third parties, use the data in a compliant manner. Thus, organizations need to consider an end-to-end approach to protecting data subjects.
In this section, we will map the challenges that arise in compliance for Kafka with the themes identified above. These challenges can be minimized if Kafka is used as a streaming data application with configuration to ensure that no data is stored for longer than a prescribed timeframe, for example no more than 30 days.
Protecting Customer Data
The project does not natively support specific functionalities such as data anonymization, pseudonymization, or automatic data deletion. Normally Kafka as a tool would not be required to implement this but it would be the responsibility of the underlying system that streams data to Kafka and the business as a whole.
Furthermore, while Apache Kafka does support encryption in transit, there is no built-in support for encryption at rest. This means 3rd party disk encryption tooling is required or accepting the risk that storage medium compromise will expose application data.
Deletion Rights
Apache Kafka’s core design principle is an immutable log, meaning once data is written to a topic, it cannot be altered. The data immutability of Kafka’s log poses a significant challenge in implementing the right to erasure when data retention is set to a maximum prescribed timeframe, such as 30 days.
One may use careful compaction strategies along with well-defined schemas to identify and erase data, however Apache Kafka compactions do not apply to backups and are not available with the latest Tiered Storage features as of this writing.
Right to Access Information
There is no native support for retrieving specific messages belonging to a particular data subject unless Kafka message identifiers are known for all messages that belong to the individual. As Apache Kafka is not optimized for ad-hoc queries and storage, this task is resource-intensive and challenging to implement efficiently without noticeable performance impacts.
Right to Correct Information
The challenges faced by right to erasure are applicable to data correction rights as well. Correction rights can often be even more complex as the downstream applications must now update their data stores to correct the information, as opposed to just identifying and deleting a data subject.
Consent
Managing consent is not something that normally falls in the context of Apache Kafka. Kafka facilitates data movement but does not inherently manage or “use” data. Managing consent is fundamentally a business responsibility. Organizations often need an appropriate legal basis (such as the data subject’s consent) for intended uses of their data, including downstream applications consuming data from Kafka.
While Kafka does not natively support consent management, organizations using consent management platforms can use metadata and headers in Kafka messages to carry consent information. Downstream systems can then enforce permissions based on this metadata, ensuring that data processing adheres to any applicable user consent and complies with data protection regulations.
Conclusion
While Apache Kafka’s robust streaming capabilities make it an excellent choice for real-time data processing, its design principles and architectural choices pose significant challenges for regulatory compliance when used with long term data retention policies. Addressing these challenges requires a combination of technical solutions—such as custom retention policies, encryption strategies, and operational practices—alongside a strong governance framework to ensure consistent policy enforcement.
In this blog we have covered some of the general themes that are common in data and privacy regulatory frameworks. In the next blog in this series, we will address the strategies that NetApp has in place for customers using the managed Apache Kafka offering, along with some general ideas on how Kafka applications can stay compliant with data and privacy regulations.