Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5's new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', 	'	replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', ' replication_factor': 1}};

CREATE TABLE IF NOT EXISTS embeddings (

id UUID,

paragraph_uuid UUID,

filename TEXT,

embeddings vector<float, 300>,

text TEXT,

last_updated timestamp,

PRIMARY KEY (id, paragraph_uuid)

);

CREATE INDEX IF NOT EXISTS ann_index

ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE INDEX IF NOT EXISTS ann_index

ON embeddings(embeddings) USING 'sai'

WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

import time

from uuid import uuid4, UUID

from cassandra.cluster import Cluster

from cassandra.query import SimpleStatement

from cassandra.policies import DCAwareRoundRobinPolicy

from cassandra.auth import PlainTextAuthProvider

from google.colab import userdata

# Connect to the single-node cluster

cluster = Cluster(

# Replace with your IP list

["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address

load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed

port=9042,

auth_provider=PlainTextAuthProvider (

username='iccassandra',

password='replace_with_your_password'

)

session = cluster.connect()

print('Connected to cluster %s' % cluster.metadata.cluster_name)

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):

try:

embeddings = list(map(float, embedding))

# Generate UUIDs if not provided

if id is None:

id = uuid4()

if paragraph_uuid is None:

paragraph_uuid = uuid4()

# Ensure id and paragraph_uuid are UUID objects

if isinstance(id, str):

id = UUID(id)

if isinstance(paragraph_uuid, str):

paragraph_uuid = UUID(paragraph_uuid)

# Create the query string with placeholders

insert_query = f"""

INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)

VALUES (?, ?, ?, ?, ?, toTimestamp(now()))

"""

# Create a prepared statement with the query

prepared = session.prepare(insert_query)

# Execute the query

session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:

error_message = f"Failed to execute query:\nError: {str(e)}"

return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,

filename=None, text=None, keyspace_name=None, max_retries=3,

retry_delay_seconds=1):

retry_count = 0

while retry_count < max_retries:

result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name)

if result is None:

return True # Successful insertion

else:

retry_count += 1

print(f"Insertion failed on attempt {retry_count} with error: {result}")

if retry_count < max_retries:

time.sleep(retry_delay_seconds) # Delay before the next retry

return False # Failed after max_retries

# Replace the file path pointing to the desired file

file_path = "/path/to/Cassandra-Best-Practices.pdf"

paragraphs_with_embeddings =

extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"):

if not insert_with_retry(

session=session,

embedding=paragraph['embedding'],

id=paragraph['uuid'],

paragraph_uuid=paragraph['paragraph_uuid'],

text=paragraph['text'],

filename=paragraph['filename'],

keyspace_name=keyspace_name,

max_retries=3,

retry_delay_seconds=1

# Display an error message if insertion fails

tqdm.write(f"Insertion failed after maximum retries for UUID

{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

import numpy as np

# ------------------ Embedding Functions ------------------

def text_to_vector(text):

"""Convert a text chunk into a vector using the FastText model."""

words = text.split()

vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index]

return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size)

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5):

# Convert the input text to an embedding

input_embedding = text_to_vector(input_text)

input_embedding_str = ', '.join(map(str, input_embedding.tolist()))

# Adjusted query without the ORDER BY clause and correct comment syntax

query = f"""

SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity

FROM {keyspace_name}.embeddings

ORDER BY embeddings ANN OF [{input_embedding_str}]

LIMIT {top_k};

"""

prepared = session.prepare(query)

bound = prepared.bind((input_embedding,))

rows = session.execute(bound)

# Sort the results by similarity in Python

similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True)

return similar_texts[:top_k]

from IPython.display import display, HTML

# The word you want to find similarities for

input_text = "place"

# Call the function to find similar texts in the Cassandra database

similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

# Print the similar texts along with their similarity scores

for similarity, filename, text in similar_texts:

html_content = f"""

Similarity: {similarity:.4f}

Text: {text}

File: {filename}

</div>

<hr/>

"""

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature	Cassandra 4 + OpenSearch	Cassandra 5 (Preview)
Embedding Storage	OpenSearch	Native VECTOR Data Type
Similarity Search	KNN Plugin in OpenSearch	COSINE, EUCLIDEAN, DOT_PRODUCT
Search Method	Exact K-Nearest Neighbor	Approximate Nearest Neighbor (ANN)
System Complexity	Requires two systems	All-in-one Cassandra solution

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

The power of vector search in Cassandra 5

How vectors work

Storage-Attached Indexing (SAI): The engine behind vector search

Example: Performing similarity search with Cassandra 5’s vector data type

Step 1: Setting up the embeddings table

Step 2: Inserting embeddings into Cassandra 5

Step 3: Performing similarity searches in Cassandra 5

Step 4: Displaying the results

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Conclusion: A simpler path to similarity search with Cassandra 5

About the author

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

The power of vector search in Cassandra 5

How vectors work

Storage-Attached Indexing (SAI): The engine behind vector search

Example: Performing similarity search with Cassandra 5’s vector data type

Step 1: Setting up the embeddings table

Step 2: Inserting embeddings into Cassandra 5

Step 3: Performing similarity searches in Cassandra 5

Step 4: Displaying the results

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Conclusion: A simpler path to similarity search with Cassandra 5

About the author

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter