One topic that commonly comes up when discussing Apache Cassandra with large enterprise clients is whether Cassandra can match feature X (audit logging, encryption at rest, column level security, etc) that is supported by Oracle, SQL Server or some other common enterprise RDBMS technology. In many cases, the answer to this question is no – Cassandra does not necessarily boast the same range of features (or, more unkindly, the same feature bloat) as leading RDBMS products.
However, enterprise security standards exist for good reasons and so it is necessary to present a solution which addresses these standards as part of the overall application solution. A key part of our answer to this question and best practice security in our view is to encrypt data as close to the point of collection and at latest, in the application layer.
We’re not alone in this viewpoint, Werner Vogels, CTO of AWS has said “We’ve got quite a few customers who’ve moved to 100% encryption. We really want to move our customers to a world where they own the keys and as such, they are the only ones who decide who has access to the data, not anybody else, not us as a provider.” (Business Insider)
Encrypting data in the application layer of your application allows you to meet many typical enterprise database security standards while maintaining a horizontally scalable and highly available architecture. For example:
- Encryption at rest – data is encrypted before it’s even received by the database and so by definition will be encrypted at rest.
- Authorization and enterprise I&AM integration – regardless of database level integration, applications will likely need to be integrated with enterprise I&AM security providers to meet functional requirements. Once the application holds the keys to unlock the data, this integration can be leveraged to implement authorization requirements at whatever level of granularity is required. Data can even be encrypted using a key or password controlled by the end user, providing a very high guarantee of access restriction on the data.
- Access logging – applications are free to implement whatever access logging is required, along with potentially much richer context than is typically available at the database layer.
For those coming from a relational database background, encrypting the data in the application may seem like it comes with a functional cost that is hard to bear. However, in the context of the more restricted query model of Cassandra, the functional cost of encrypting at the application is very limited. Consider the following:
- Partition keys (the subset of the primary key column that is used to determine the distribution of data amongst the nodes): these keys are actually translated into a hash value by Cassandra and can therefore only be used for equality operators. Equality can be evaluated between encrypted values just as well as unencrypted so there is no impact here.
- Clustering keys (the remaining subset of the primary key after the partition key columns are taken out): clustering key columns impact the ordering of data on disk and can be used for range queries (> and <). Encrypting these columns can therefore reduce available query functionality (as values would need to be decrypted before being evaluated). However, by far the most common use case for range queries is querying on date ranges (for a series of events) and it is hard to think of many situations where dates themselves are sensitive once any associated identifying data is encrypted.
- Non-key values: Cassandra does not (except with “allow filtering” which is generally not recommended) allow filtering on non-key columns. Non-key data can therefore be encrypted without any real loss in functionality. The one possible loss is if you want to use Cassandra aggregation to; for example, calculate a sum of values. Again, it’s hard to think of a situation where the value you are going to calculate a sum on is sensitive once identifying data is encrypted. The exception to this, which requires closer examination, is if you are planning on using secondary index search technology such as the Lucene index or Apache Spark to access your data. In many cases, careful consideration of application design and what to encrypt can resolve these limitations.
There are many ways you can implement this encryption – including standard encryption libraries (Apache Commons Crypto) called by wrapper classes in your code or a driver that supports encryption such as that provided by our partner baffle.io.
In summary, we believe that encrypting as close as possible to the point of data collection, rather than trying to protect it at many points in your application stack is definitely the best approach to protect your data. With Apache Cassandra, the cost you pay for implementing this encryption may not be as significant as it first seems.