Part of Instaclustr’s service is helping our customers make better use of Apache Cassandra. This post outlines some of the most common issues that we see and help our customers overcome.
At Instaclustr, we often work with our customers to help diagnose and correct issues with their Cassandra data models. This occurs either as part of a scheduled review or when we see issues arise in production.
From this experience, we’ve identified a few data modelling traps that we have seen many customers fall into and that anyone starting with Cassandra should be aware of.
Partition key
Firstly, understanding the role of the partition key in Cassandra is critical. The partition key determines the distribution of data across nodes and is also the primary method for Cassandra to look up data. A good partition key must have:
- a high cardinality (ie large number of values);
- a reasonably consistent and bounded number of records for each value of the partition key (ie you don’t want most partition to have 10 records and one to have 1 million);
- low volatility (ie once written records don’t often change their partition key); and
- match the required retrieval patterns of the application.
Of course, in some cases it won’t be possible to perfectly fit these requirements. However, where this is not the case careful consideration and testing is required. Also, be aware of how to specify the partition key in Cassandra = PRIMARY KEY (field1, field2) is not that same as PRIMARY KEY ((field1, field2)) (the first only using field1 as the partition key, the second a composite of both fields).
Deleting data
Secondly, be aware of Cassandra’s approach to deleting data and its impacts. Rows are virtually deleted (marked as tombstones) and not actually removed from disk until sometime after gc_grace_seconds (default 10 days) expires. Also, an update to a row that changes its partition key is implemented as a delete followed by an insert. This means that it is possible for virtually deleted data (tombstones) to build up in your tables to the point where Cassandra is spending more time wading through deletes than good data and very significant performance issues can result. Depending on your application, a different data model, compaction strategy or gc_grace_seconds setting may be required. One common example where we see this issue is implementing queues or queue-like constructs in a table.
Secondary indexes
Secondary indexes are the final area where we commonly see issues. Like any database, maintaining indexes is not free. However, Cassandra secondary indexes have some additional factors that you need to be aware of. These are explained in detail in the here but in summary:
- only index low (but not too low) cardinality columns;
- don’t index columns that are frequently updated or deleted; and
- be wary if you have very large partitions.
We’ve recently published all the key items we look for in a data model review in a help article available here. Of course, these help you identify the issues – developing the correct solutions to the issues will generally be specific to your application. If you are an Instaclustr customer on our medium or enterprise support plan, then this advice can be provided as part of your included Cassandra advice hours (and as a paid service for other customers). Contact us at [email protected] to find out more about this today.