Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

Create the Table:
Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘

CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',    	'
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy', '

replication_factor': 3};

USE file_metadata;

DROP TABLE IF EXISTS file_metadata;

CREATE TABLE IF NOT EXISTS file_metadata (

id UUID,

paragraph_uuid UUID,

filename TEXT,

text TEXT,

last_updated timestamp,

PRIMARY KEY (id, paragraph_uuid)

);

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

Create the index:
Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.

{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

{

"settings": {

"index": {

"number_of_shards": 2,

"knn": true,

"knn.space_type": "l2"

}

"mappings": {

"properties": {

"file_uuid": {

"type": "keyword"

"paragraph_uuid": {

"type": "keyword"

"embedding": {

"type": "knn_vector",

"dimension": 300

}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = <a href="https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip">'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'</a> 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

import requests

import zipfile

import os

# Adjust file_url and local_filename variables accordingly

file_url = <a href="https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip">'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'</a>

local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip'

extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/'

def download_file(url, filename):

with requests.get(url, stream=True) as r:

r.raise_for_status()

os.makedirs(os.path.dirname(filename), exist_ok=True)

with open(filename, 'wb') as f:

for chunk in r.iter_content(chunk_size=8192):

f.write(chunk)

def unzip_file(filename, extract_to):

with zipfile.ZipFile(filename, 'r') as zip_ref:

zip_ref.extractall(extract_to)

# Download and extract

download_file(file_url, local_filename)

unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly:

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

from gensim.models import KeyedVectors

# Adjust model_path variable accordingly

model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"

fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

import numpy as np

import re

def text_to_vector(text):

"""Convert text into a vector using the FastText model."""

text = text.lower()

words = re.findall(r'\b\w+\b', text)

vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index]

if not vectors:

print(f"No embeddings found for text: {text}")

return np.zeros(fasttext_model.vector_size)

return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

import uuid

import mimetypes

import pandas as pd

from pdfminer.high_level import extract_pages

from pdfminer.layout import LTTextContainer

from docx import Document

from pptx import Presentation

def generate_deterministic_uuid(name):

return uuid.uuid5(uuid.NAMESPACE_DNS, name)

def generate_random_uuid():

return uuid.uuid4()

def get_file_type(file_path):

# Guess the MIME type based on the file extension

mime_type, _ = mimetypes.guess_type(file_path)

return mime_type

def extract_text_from_excel(excel_path):

xls = pd.ExcelFile(excel_path)

text_list = []

for sheet_index, sheet_name in enumerate(xls.sheet_names):

df = xls.parse(sheet_name)

for row in df.iterrows():

text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1)) # +1 to make it 1 based index

return text_list

def extract_text_from_pdf(pdf_path):

return [(text_line.get_text().strip().replace('\xa0', ' '), page_num)

for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1)

for element in page_layout if isinstance(element, LTTextContainer)

for text_line in element if text_line.get_text().strip()]

def extract_text_from_word(file_path):

doc = Document(file_path)

return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()]

def extract_text_from_txt(file_path):

with open(file_path, 'r') as file:

return [(line.strip(), 1) for line in file.readlines() if line.strip()]

def extract_text_from_pptx(pptx_path):

prs = Presentation(pptx_path)

return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1)

for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()]

def extract_text_with_page_number_and_embeddings(file_path, embedding_function):

file_uuid = generate_deterministic_uuid(file_path)

file_type = get_file_type(file_path)

extractors = {

'text/plain': extract_text_from_txt,

'application/pdf': extract_text_from_pdf,

'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word,

'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx,

'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [],

'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel,

'application/vnd.ms-excel': extract_text_from_excel

}

text_list = extractors.get(file_type, lambda _: [])(file_path)

return [

{

"uuid": file_uuid,

"paragraph_uuid": generate_random_uuid(),

"filename": file_path,

"text": text,

"page_num": page_num,

"embedding": embedding

}

for text, page_num in text_list

if (embedding := embedding_function(text)).any() # Check if the embedding is not all zeros

]

# Replace the file path with the one you want to process

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"

paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

from tqdm import tqdm

# Function to insert data into both Cassandra and OpenSearch

def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name):

# Insert into Cassandra

cassandra_result = insert_with_retry(

session=session,

id=paragraph['uuid'],

paragraph_uuid=paragraph['paragraph_uuid'],

text=paragraph['text'],

filename=paragraph['filename'],

keyspace_name=keyspace_name,

max_retries=3,

retry_delay_seconds=1

)

if not cassandra_result:

return False # Stop further processing if Cassandra insertion fails

# Insert into OpenSearch

opensearch_result = insert_embedding_to_opensearch(

os_client=os_client,

index_name=index_name,

file_uuid=paragraph['uuid'],

paragraph_uuid=paragraph['paragraph_uuid'],

embedding=paragraph['embedding']

)

if opensearch_result is not None:

return False # Return False if OpenSearch insertion fails

return True # Return True on success for both

# Process each paragraph with a progress bar

print("Starting batch insertion of paragraphs.")

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"):

if not insert_paragraph_data(

session=session,

os_client=os_client,

paragraph=paragraph,

keyspace_name=keyspace_name,

index_name=index_name

print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...")

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

import uuid

from IPython.display import display, HTML

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5):

"""Search for similar embeddings in OpenSearch and return the associated UUIDs."""

query = {

"size": top_k,

"query": {

"knn": {

"embedding": {

"vector": input_embedding.tolist(),

"k": top_k

}

response = os_client.search(index=index_name, body=query)

similar_uuids = []

for hit in response['hits']['hits']:

file_uuid = hit['_source']['file_uuid']

paragraph_uuid = hit['_source']['paragraph_uuid']

similar_uuids.append((file_uuid, paragraph_uuid))

return similar_uuids

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name):

"""Fetch the metadata (text and filename) from Cassandra based on UUIDs."""

file_uuid = uuid.UUID(file_uuid)

paragraph_uuid = uuid.UUID(paragraph_uuid)

query = f"""

SELECT text, filename

FROM {keyspace_name}.file_metadata

WHERE id = ? AND paragraph_uuid = ?;

"""

prepared = session.prepare(query)

bound = prepared.bind((file_uuid, paragraph_uuid))

rows = session.execute(bound)

for row in rows:

return row.filename, row.text

return None, None

# Input text to find similar embeddings

input_text = "place"

# Convert input text to embedding

input_embedding = text_to_vector(input_text)

# Find similar embeddings in OpenSearch

similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10)

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch

for file_uuid, paragraph_uuid in similar_uuids:

filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid,

keyspace_name)

if filename and text:

html_content = f"""

File UUID: {file_uuid}

Paragraph UUID: {paragraph_uuid}

Text: {text}

File: {filename}

</div>

<hr/>

"""

display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.