• Apache Cassandra
  • Technical
Fixing a compaction mistake

Apache Cassandra’s log-structured storage engine delivers amazing performance by ensuring that all inserts and updates are written sequentially to disk. Once data is written to disk as an sstable it is never modified. When you change a record, the new version of it is written to the next data sstable on disk.

Over time read performance can deteriorate when a number of versions of the same record exist. Cassandra has to read a number of sstables and check each version of the record to determine the newest.

Cassandra runs a background task called compaction that periodically merges sstables together into larger sstables and consolidates all the various updates done to a row.

Cassandra has two compaction strategies called Size-Tiered and Leveled compaction. Both strategies have their pros and cons, however Size-Tiered has one particular nasty gotcha which is the idea of a major compaction. A major compaction merges all sstables into a single sstable, which means that that minor compactions no longer occur as new tables can no longer can merge into the single giant table.

If you have accidentally made this mistake on your production cluster, fear not, there are a number of ways out.

sstable_split will split a specified sstable into a set of sstables of a specified size.

  • Checkout the sstable_split branch and run ant.
  • stop the node,
  • sstable_split sstablefilename
  • then start the node again

Make sure you backup the sstable before you remove it, at a minimum move it out of the data directory instead of deleting it straight away.

The other option is to switch the sstable compaction strategy from SizeTiered to Leveled. This will force the sstables to be rewritten from scratch and then you can switch back to SizeTiered.

The second option is not recommended unless you absolutely cannot take a single node offline… ever.