Apache Cassandra tutorial: Cheat sheet, basic setup and vector search

What is Apache Cassandra?

Apache Cassandra is a distributed NoSQL database for handling large amounts of data across commodity servers without a single point of failure. It offers high availability and is optimized for write-heavy workloads. Developed initially by Facebook and open-sourced in 2008, it has since grown into an ecosystem used by companies like Netflix, Apple, and Twitter. Its scalability allows it to handle petabytes of data effortlessly, making it a preferred choice for modern applications that require fast data access and reliability.

Cassandra uses a masterless architecture where all nodes are equal and participate in a peer-to-peer protocol. This setup ensures that no single server acts as a bottleneck. It supports dynamic scaling; new nodes can be added or removed with minimal service disruption. Its data model combines key-value pairs with tables, enabling flexible schema designs. The main features include tunable consistency levels, efficient writes, and decentralized control, making it a suitable database solution for large-scale, distributed systems.

How Apache Cassandra works?

Architecture

Apache Cassandra’s architecture is based on a decentralized, masterless design, where all nodes play equal roles in the cluster. This peer-to-peer structure enables horizontal scalability and eliminates single points of failure. Data in Cassandra is partitioned and distributed across multiple nodes using a consistent hashing mechanism.

Each node is responsible for a portion of the data, and new nodes can be added without significant disruption. Data replication ensures fault tolerance, with configurable replication factors allowing multiple copies of the same data to exist on different nodes. This architecture supports linear scalability and ensures high availability even in the event of node failures.

Data model

Cassandra’s data model is centered around tables that store rows and columns, similar to traditional relational databases, but with significant differences. In Cassandra, data modeling is query-driven, which means that the schema is designed to optimize specific queries.

Unlike relational databases that use joins across tables, Cassandra denormalizes data by duplicating it across multiple tables, enabling high-performance reads. This approach relies on a primary key structure where the partition key distributes data across nodes and clustering keys determine the sort order within a partition. Denormalization ensures efficient data retrieval by minimizing the need to reference multiple tables.

Consistency and availability

Cassandra also offers tunable consistency, allowing users to choose between eventual and strong consistency based on their requirements. For example, clients can adjust how many nodes need to acknowledge a read or write operation before it is considered successful, offering a balance between performance and data accuracy.

To maintain read efficiency and manage disk space, Cassandra performs compaction, merging SSTables and discarding deleted data. Nodes communicate using a gossip protocol, sharing state information to detect and handle node failures dynamically. This decentralized communication lets the system adapt to changes.

Usage

cqlsh is the command-line shell for interacting with Apache Cassandra using the Cassandra Query Language (CQL). It allows users to execute queries, manage the database schema, and perform administrative tasks like inspecting table structures or modifying keyspaces. cqlsh is essential for operations like querying data, creating or altering tables, and managing clusters in a flexible and interactive environment.

Cassandra is optimized for high write throughput through its unique write path. When a write request is received, it is first recorded in the commit log to ensure durability. The data is then written to an in-memory structure called a memtable. Once the memtable reaches a threshold, it is flushed to disk into immutable files called SSTables (sorted string tables).

Apache Cassandra commands cheat sheet

cqlsh shell commands

Command	Description
`HELP`	Displays help topics for cqlsh commands.
`CAPTURE`	Captures output and writes it to a file.
`CONSISTENCY`	Displays or sets the consistency level.
`COPY`	Copies data to or from Cassandra.
`DESCRIBE`	Provides information about the cluster and objects.
`EXPAND`	Expands query results vertically.
`EXIT`	Exits the cqlsh shell.
`PAGING`	Enables or disables paging of query results.
`SHOW`	Displays details of the current cqlsh session.
`SOURCE`	Executes a file containing CQL statements.
`TRACING`	Enables or disables request tracing.

Data definition commands (DDL)

Command	Description
`CREATE KEYSPACE`	Creates a new keyspace.
`USE`	Switches to the specified keyspace.
`ALTER KEYSPACE`	Modifies the properties of a keyspace.
`DROP KEYSPACE`	Deletes a keyspace.
`CREATE TABLE`	Creates a new table in the keyspace.
`ALTER TABLE`	Modifies a table’s schema.
`DROP TABLE`	Deletes a table.
`TRUNCATE`	Removes all data from a table.
`CREATE INDEX`	Creates an index on a column.
`DROP INDEX`	Deletes an index.

Data Manipulation Commands (DML)

Command	Description
`INSERT`	Adds a new row or updates a row in a table.
`UPDATE`	Updates specific columns in a row.
`DELETE`	Removes data from a table.
`BATCH`	Executes multiple DML statements in a single operation.

CQL clauses

Clause	Description
`SELECT`	Retrieves data from a table.
`WHERE`	Filters the results of a SELECT query.
`ORDER BY`	Sorts query results in a specified order.

Tips from the expert

Ritam Das

Solution Architect

Ritam Das is a trusted advisor with a proven track record in translating complex business problems into practical technology solutions, specializing in cloud computing and big data analytics.

In my experience, here are tips that can help you better adapt to using Apache Cassandra:

Optimize partition key design: Design your partition keys to avoid hotspots. Ensure even data distribution by considering access patterns and query frequency. A well-thought-out partition key can significantly reduce read and write latencies.
Use materialized views cautiously: Materialized views can simplify query logic but may introduce performance overhead. Always monitor the performance impact and consider alternatives like secondary indexes or denormalization for complex queries.
Monitor compaction strategies: Choose the right compaction strategy based on your workload (e.g., Leveled Compaction for read-heavy and SizeTiered Compaction for write-heavy scenarios). Regularly monitor compaction metrics to prevent disk space issues and optimize performance. The right strategy is key in disk space reclamation.
Pre-split keyspaces for large datasets: Pre-splitting keyspaces can help avoid write amplification issues as the dataset grows. By anticipating future growth, you reduce the need for heavy compactions and rebalancing operations later on.
Optimize read paths with caching: Leverage the row cache or key cache depending on your access patterns. For frequently read small datasets, row cache can dramatically reduce read latency, while key cache improves index lookups for larger datasets.

Getting started with Apache Cassandra

Setting up Apache Cassandra using Docker is straightforward and helps you quickly spin up a development environment. These instructions are adapted from the official quick start guide.

Step 1: Get Cassandra using Docker

First, ensure you have Docker Desktop installed on your machine. You can pull the latest Cassandra image from Docker Hub using the following command:

docker pull cassandra:latest

1	docker pull cassandra:latest

Step 2: Start Cassandra

Create a Docker network to allow access to the container’s ports without exposing them on the host:

docker network create cassandra

1	docker network create cassandra

Then, start a Cassandra container:

docker run --rm -d --name cassandra --hostname cassandra --network cassandra cassandra

1	docker run --rm -d --name cassandra --hostname cassandra --network cassandra cassandra

Step 3: Create a CQL script

Next, create a CQL script file named data.cql. The Cassandra Query Language (CQL) is similar to SQL but optimized for Cassandra’s distributed architecture. The following script creates a keyspace, a table, and inserts some data:

-- Create a keyspace
CREATE KEYSPACE IF NOT EXISTS store WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

1 2	-- Create a keyspace CREATE KEYSPACE IF NOT EXISTS store WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

terminal screenshot

Note: Tracing is set to ON for the above request. Hence you see additional information

-- Create a table
CREATE TABLE IF NOT EXISTS store.user_activity (
userid text PRIMARY KEY,  
item_count int,  
last_update_timestamp timestamp
);

-- Create a table

CREATE TABLE IF NOT EXISTS store.user_activity (

userid text PRIMARY KEY,

item_count int,

last_update_timestamp timestamp

);

terminal screenshot

Note: Tracing is set to OFF for the above request. Hence you don’t see additional information

-- Insert some data

INSERT INTO store.user_activity (userid, item_count, last_update_timestamp)
VALUES ('776', 2, toTimeStamp(now()));

INSERT INTO store.user_activity (userid, item_count, last_update_timestamp)
VALUES ('365', 5, toTimeStamp(now()));

-- Insert some data

INSERT INTO store.user_activity (userid, item_count, last_update_timestamp)

VALUES ('776', 2, toTimeStamp(now()));

INSERT INTO store.user_activity (userid, item_count, last_update_timestamp)

VALUES ('365', 5, toTimeStamp(now()));

terminal screenshot

Save this script in the data.cql file.

Alternatively, you can use cqlsh interactively to run CQL commands. Launch the interactive shell with:

docker run --rm -it --network cassandra nuvo/docker-cqlsh cqlsh cassandra 9042 --cqlversion='3.4.5' OR

1	docker run --rm -it --network cassandra nuvo/docker-cqlsh cqlsh cassandra 9042 --cqlversion='3.4.5' OR

Note: If you have installed Cassandra directly on your server, you can directly run this from the command line using the following command. This will give you a prompt where you can execute CQL commands directly.

<Cassandara-installation-folder>/bin/csqlsh

1	<Cassandara-installation-folder>/bin/csqlsh

Step 4: Load data with CQLSH

Use the CQL shell (cqlsh) to load the data into Cassandra. Run the following command to load the script:

docker run --rm --network cassandra -v "$(pwd)/data.cql:/scripts/data.cql" -e CQLSH_HOST=cassandra -e CQLSH_PORT=9042 -e CQLVERSION=3.4.6 nuvo/docker-cqlsh OR

1	docker run --rm --network cassandra -v "$(pwd)/data.cql:/scripts/data.cql" -e CQLSH_HOST=cassandra -e CQLSH_PORT=9042 -e CQLVERSION=3.4.6 nuvo/docker-cqlsh OR

Note: If you have installed Cassandra directly on your server, you can directly run from command line using the following command: (assuming data.cql exists in the same folder):

SOURCE ‘data.cql’;

1	SOURCE ‘data.cql’;

Step 5: Read some data

To read data from the table, execute the following query in the cqlsh shell:

SELECT * FROM store.user_activity;

1	SELECT * FROM store.user_activity;

terminal screenshot

Step 6: Write more data

Insert additional data into the table using:

INSERT INTO store.user_activity (userid, activity_count) VALUES ('842', 20);

1	INSERT INTO store.user_activity (userid, activity_count) VALUES ('842', 20);

terminal screenshot

Step 8: Clean up

After you are done, clean up by stopping the Cassandra container and removing the Docker network:

docker kill cassandra
docker network rm cassandra

1 2	docker kill cassandra docker network rm cassandra

Tutorial: Apache Cassandra and vector search

Now let’s take a look at a more advanced case study. Vector search in Apache Cassandra allows for efficient similarity queries on high-dimensional data, often used in machine learning and recommendation systems. This tutorial will guide you through setting up vector search using CQL (Cassandra Query Language).

Step 1: Create Vector Keyspace

First, create a keyspace to store your vector data. The following CQL command sets up a keyspace named ai_tweets:

CREATE KEYSPACE IF NOT EXISTS ai_tweets
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

1 2	CREATE KEYSPACE IF NOT EXISTS ai_tweets WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

terminal screenshot

Step 2: Use vector keyspace

Select the ai_tweets keyspace for the subsequent operations:

USE ai_tweets;

1	USE ai_tweets;

terminal screenshot

Step 3: Create vector table

Next, create a table to store vector data. The table includes a tweet_vector column, which holds the vector values:

CREATE TABLE IF NOT EXISTS ai_tweets.tweets_vs (
  record_id timeuuid,
  id uuid,
  user_handle text,
  tweet text,
  tweet_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);

CREATE TABLE IF NOT EXISTS ai_tweets.tweets_vs (

record_id timeuuid,

id uuid,

user_handle text,

tweet text,

tweet_vector VECTOR <FLOAT, 5>,

created_at timestamp,

PRIMARY KEY (id, created_at)

)

WITH CLUSTERING ORDER BY (created_at DESC);

terminal screenshot

You can use describe tables command to list all current tables in key store:

terminal screenshot

Alternatively, you can add a vector column to an existing table:

ALTER TABLE ai_tweets.tweets_vs
ADD tweet_vector VECTOR <FLOAT, 5>;

1 2	ALTER TABLE ai_tweets.tweets_vs ADD tweet_vector VECTOR <FLOAT, 5>;

Step 4: Create vector index

Create an index on the tweet_vector column using storage attached indexing (SAI):

CREATE INDEX IF NOT EXISTS ann_index
ON ai_tweets.tweets_vs(tweet_vector) USING 'sai';

1 2	CREATE INDEX IF NOT EXISTS ann_index ON ai_tweets.tweets_vs(tweet_vector) USING 'sai';

terminal screenshot

You can specify the similarity function in the index options:

CREATE INDEX IF NOT EXISTS ann_index
ON ai_tweets.tweets_vs(tweet_vector) USING 'sai'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE INDEX IF NOT EXISTS ann_index

ON ai_tweets.tweets_vs(tweet_vector) USING 'sai'

WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

The valid values for similarity_function are DOT_PRODUCT, COSINE, and EUCLIDEAN.