Overview
Multi-value data types (sets, lists and maps) are a powerful feature of Cassandra, aiding you in denormalisation while allowing you to still retrieve and set data at a very fine-grained level. However, some of Cassandra’s behaviour when handling these data types is not always as expected and can cause issues.
In particular, there can be hidden surprises when you update the value of a collection type column. For simple-type columns, Cassandra performs an update by simply writing a new value for the cell and the most recently written value wins when the data is read. However, when you overwrite a collection Cassandra can’t simply write the new elements because all the existing elements in the map have their own individual cells and would still be returned alongside the new elements whenever a read is performed on the map.
The options
This leaves Cassandra with two options:
- Perform a read and discover all the existing map elements and either delete them or update them if they were specified in the overwrite.
- Forget about all existing elements in the map by deleting them.
Option 1 doesn’t sound very optimised, does it? A read for every write you perform? Ouch.
Cassandra chooses option 2 because it just can’t resist those performance gains. It knows you’re performing an overwrite, and that you obviously don’t care about the contents of those columns, so it will delete them for you, and we can all pretend they never existed in the first place.
Or so we thought… until one day your queries start failing because you’ve hit 100k tombstones. Didn’t expect that, especially when you never delete any data.
In most cases, compactions will just handle this problem for you and the tombstones will be gone before you even get close to the query failure limit. However, compaction strategies aren’t perfect and depending on how much you overwrite, plus how well compactions remove those tombstones, there are many cases where this behaviour can become a huge issue. If you are performing many writes, and all of them are overwrites where a collection type is involved, you will be generating a tombstone for every single write.
Examples for avoiding the issue
I’ve created a very basic schema with a map and a few fields, as below:
1 2 3 4 5 6 7 |
CREATE KEYSPACE tombs WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; CREATE TABLE tombs.staff ( id text PRIMARY KEY, name text, age int locations map<text, text>; ) |
I then inserted a single row and performed a flush:
1 2 |
> insert into staff (id, locations) values ('a', {'bldg1':'4a'}); $ nodetool flush |
And I now have an SSTable in my tombs.staff data directory.
1 2 3 4 5 6 7 8 9 |
/var/lib/cassandra/data/tombs/staff/ $ ls -l -rw-r--r-- 1 root root 43 Jun 20 11:53 tombs-staff-ka-1-CompressionInfo.db -rw-r--r-- 1 root root 82 Jun 20 11:53 tombs-staff-ka-1-Data.db -rw-r--r-- 1 root root 10 Jun 20 11:53 tombs-staff-ka-1-Digest.sha1 -rw-r--r-- 1 root root 16 Jun 20 11:53 tombs-staff-ka-1-Filter.db -rw-r--r-- 1 root root 15 Jun 20 11:53 tombs-staff-ka-1-Index.db -rw-r--r-- 1 root root 4449 Jun 20 11:53 tombs-staff-ka-1-Statistics.db -rw-r--r-- 1 root root 83 Jun 20 11:53 tombs-staff-ka-1-Summary.db -rw-r--r-- 1 root root 91 Jun 20 11:53 tombs-staff-ka-1-TOC.txt |
Using sstable2json to analyse the data, as expected we have one key, a, however it has two locations entries, despite the fact we only did one write.
This is to do with the map, and the whole overwrite thing I was talking about earlier. Already we can see that C* has written a range tombstone for the locations cell immediately before writing the value that I inserted.
1 2 3 4 5 6 7 8 9 |
$ sstable2json tombs-staff-ka-1-Data.db [ {"key": "a", "cells": [["","",1466423546119508], # Below is a range tombstone against locations. # Note the timestamp occurs just before the next entry and the "t" - for tombstone. ["locations:_","locations:!",1466423546119507,"t",1466423546], # and here we have the map entry, ASCII encoded. Where 626c646731="bldg1" and 3461="4a". ["locations:626c646731","3461",1466423546119508]]} |
Now, this is kind of a spoiler, as we haven’t actually done any “overwrites” yet, but we’ve identified the feature we’re talking about. This is because in Cassandra, overwrites, updates, and inserts, are really all just the same thing. The insert against the map will do the same thing whether the key already exists or not.
Anyway, we can see how this delete first strategy begins to work if we simply insert another record with the same key:
1 2 |
> insert into staff (id, locations) values ('a', {'bldg4':'4c'}); $ nodetool flush |
We now have 2 sstables: tombs-staff-ka-1-Data.db and tombs-staff-ka-2-Data.db. And if we run sstable2json on the new SSTable, we see a very similar entry:
1 2 3 4 5 6 |
$ sstable2json tombs-staff-ka-2-Data.db [ {"key": "a", "cells": [["","",1466424601968750], ["locations:_","locations:!",1466424601968749,"t",1466424601], ["locations:626c646731","3464",1466424601968750]]} |
Nothing surprising, and furthermore, if we trigger a major compaction against our 2 SSTables:
1 |
$ nodetool compact tombs staff |
And run sstable2json against our new SSTable…
1 2 3 4 5 6 |
$ sstable2json tombs-staff-ka-3-Data.db [ {"key": "a", "cells": [["","",1466424601968750], ["locations:_","locations:!",1466424601968749,"t",1466424601], ["locations:626c646731","3464",1466424601968750]]} |
We have the latest range tombstone plus the latest insert, and compactions have, as expected, gotten rid of the previous insert as it knows everything older than the latest range tombstone is moot.
Now you can start to see where issues can arise when overwriting a key with a collection type. If it weren’t for the compaction, I’d have 2 tombstones for that single row across 2 SSTables. Obviously, it’s very likely those SSTables will compact and the tombstones will get cleared out, however things are not always as clear cut, especially when you are frequently overwriting keys and the tombstones get spread across many SSTables of differing sizes, causing tombstone bloat that may not be removed when left up to minor compactions.
So how can we avoid this potential catastrophe? A simple solution would be to instead store JSON and leave the updates to your application, however, there is an alternative. You can use the provided append and subtraction operators. These operators will modify the collection without having to perform a read, and also won’t create any range tombstones. This works for specific use cases where you simply need to insert/append/prepend, however, if you frequently find yourself having to rewrite a whole collection you will need to take a different approach. You can also specify a collection as frozen which would give the desired overwrite behaviour, but you will no longer be able to add and remove elements using the +, -, and [] operators.
Here is an example of performing collection operations on a list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
ALTER TABLE staff ADD leave_dates list<text>; # Creates a tombstone and an entry in the list insert into staff (id, leave_dates) values ('c', ['20160620']); $ nodetool flush $ sstable2json tombs-staff-ka-6-Data.db [ {"key": "c", "cells": [["","",1466427765961455], ["leave_dates:_","leave_dates:!",1466427765961454,"t",1466427765], ["leave_dates:484b79b036e711e681757906eb0f5a6e","3230313630363230",1466427765961455]]} ] # Prepends an element to the list without creating any additional tombstones UPDATE staff SET leave_dates = [ '20160621' ] + leave_dates where id='c'; $ nodetool flush # The new SSTable has only a single entry in the list, no extra tombstone. # This works the same for appending to the list as well. $ sstable2json tombs-staff-ka-7-Data.db [ {"key": "c", "cells": [["leave_dates:af13b22fb5e911d781757906eb0f5a6e","3230313630363231",1466427869996855]]} ] |
Be careful when using addition and subtraction on list types, as removing elements from a list can be an expensive operation. Cassandra will have to read in the entire list in order to remove a single entry. Note that this is not true for sets, removing a single entry from a set requires no reads, as Cassandra will simply write a tombstone for the matching cell.
See the below trace for deletion from a list, where we can clearly see C* performing a read query before making the modifications.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Removing a single index instaclustr@cqlsh:tombs> DELETE leave_dates[1] FROM staff where id='c'; # Trace: activity | timestamp | source | source_elapsed Execute CQL3 query | 2016-06-20 13:34:26.214 | 52.37.11.43 | 0 Parsing DELETE leave_dates[1] FROM staff where id='c'; | 2016-06-20 13:34:26.217 | 52.37.11.43 | 148 Preparing statement | 2016-06-20 13:34:26.217 | 52.37.11.43 | 272 Executing single-partition query on staff | 2016-06-20 13:34:26.217 | 52.37.11.43 | 794 Acquiring sstable references | 2016-06-20 13:34:26.218 | 52.37.11.43 | 829 Merging memtable tombstones | 2016-06-20 13:34:26.218 | 52.37.11.43 | 845 Partition index with 0 entries found for sstable 13 | 2016-06-20 13:34:26.219 | 52.37.11.43 | 905 Seeking to partition beginning in data file | 2016-06-20 13:34:26.219 | 52.37.11.43 | 927 Partition index with 0 entries found for sstable 12 | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1037 Seeking to partition beginning in data file | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1049 Skipped 0/2 non-slice-intersecting sstables | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3391 Merging data from memtables and 2 sstables | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3409 Read 3 live and 1 tombstone cells | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3523 Determining replicas for mutation | 2016-06-20 13:34:26.221 | 52.37.11.43 | 3899 Appending to commitlog | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3950 Adding to staff memtable | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3968 Request complete | 2016-06-20 13:34:26.218 | 52.37.11.43 | 4877 |
The following statements for the SET type result in similar functionality. Note that appending and prepending is non-existent with sets, it is simply added and remove.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
ALTER TABLE staff ADD leave_dates list<text>; # Creates a tombstone and an entry in the list insert into staff (id, leave_dates) values ('c', ['20160620']); $ nodetool flush $ sstable2json tombs-staff-ka-6-Data.db [ {"key": "c", "cells": [["","",1466427765961455], ["leave_dates:_","leave_dates:!",1466427765961454,"t",1466427765], ["leave_dates:484b79b036e711e681757906eb0f5a6e","3230313630363230",1466427765961455]]} ] # Prepends an element to the list without creating any additional tombstones UPDATE staff SET leave_dates = [ '20160621' ] + leave_dates where id='c'; $ nodetool flush # The new SSTable has only a single entry in the list, no extra tombstone. # This works the same for appending to the list as well. $ sstable2json tombs-staff-ka-7-Data.db [ {"key": "c", "cells": [["leave_dates:af13b22fb5e911d781757906eb0f5a6e","3230313630363231",1466427869996855]]} ] |