Introduction

When using Apache Cassandra a strong understanding of the concept and role of partitions is crucial for design, performance, and scalability. This blog covers the key information you need to know about partitions to get started with Cassandra. It covers topics including how to define partitions, how Cassandra uses them, what are the best practices and known issues.

Cassandra stores data with tunable consistency in partitions across a cluster, with each partition representing a set of rows. Partitioning is performed through a mathematical function and data locality is determined by the partition key.

Data partitioning is a common concept amongst distributed data systems. Such systems distribute incoming data into chunks called ‘partitions’. Features such as replication, data distribution, and indexing use a partition as their atomic unit. Data partitioning is usually performed using a simple mathematical function such as identity, hashing, etc. The function uses a configured data attribute called ‘partition key’ to group data in distinct partitions.

Consider an example where we have server logs as incoming data. This data can be partitioned using the log timestamp rounded to the hour value — this partitioning configuration results in data partitions with one hour worth of logs each. Here the partitioning function used is the ‘identity function and the partition key used is a timestamp with a rounded hour. 

Apache Cassandra Data Partitions

Apache Cassandra, a NoSQL database, belongs to the big data family of applications and operates as a distributed system, and uses the principle of data partitioning as explained above. Data partitioning is performed using a partitioning algorithm which is configured at the cluster level while the partition key is configured at the table level. 


The Cassandra Query Language (CQL) is designed on SQL terminologies of table, rows and columns. A table is configured with the ‘partition key’ as a component of its primary key. Let’s take a deeper look at the usage of the Primary key in the context of a Cassandra cluster.

Primary Key = Partition Key + [Clustering Columns]

A primary key in Cassandra represents a unique data partition and data arrangement within a partition. The optional clustering columns handle the data arrangement part. A unique partition key represents a set of rows in a table which are managed within a server (including all servers managing its replicas).

A primary key has the following CQL syntax representations:

Definition 1:

CREATE TABLE server_logs(

   log_hour timestamp PRIMARYKEY,

   log_level text,

   message text,

   server text

   )

partition key: log_hour 

clustering columns: none

Definition 2:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY (log_hour, log_level)

   )

partition key: log_hour 

clustering columns: log_level

Definition 3:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY ((log_hour, server))

   )

partition key: log_hour, server

clustering columns: none

Definition 4:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY ((log_hour, server),log_level)

   )WITH CLUSTERING ORDER BY (column3 DESC);

partition key: log_hour, server

clustering columns: log_level

This set of rows is generally referred to as a ‘partition’. 

  • Definition1 has all the rows sharing a ‘log_hour’ as a single partition.  
  • Definition2 has the same partition key as Definition1, but all rows in each partition are arranged with the ascending order ‘log_level’.
  • Definition3 has all the rows sharing a ‘log_hour’ for each distinct ‘server’ as a single partition.  
  • Definition4 has the same partition as Definition3, but it arranges the rows with descending order of ‘log_level’ within the partition.

Cassandra read and write operations are performed using a partition key on a table. Cassandra uses ‘tokens’ (a long value out of range -2^63 to +2^63 -1) for data distribution and indexing. The tokens are mapped to the partition keys using a ‘partitioner’. The partitioner applies a partitioning function to convert any given partition key to a token. Each node in a Cassandra cluster owns a set of data partitions using this token mechanism. The data is then indexed on each node with the help of the partition key. The takeaway here is, Cassandra uses a partition key to determine which node store data on and where to find data when it’s needed.

See below diagram of Cassandra cluster with 3 nodes and token-based ownership. 

Cassandra Partitions - Representation of Tokens

*This is a simple representation of tokens, the actual implementation uses Vnodes.