Apache Cassandra is an open source non-relational, or NoSQL, distributed database that enables continuous availability, tremendous scale, and data distribution across multiple data centers and cloud availability zones. Simply put, it provides a highly reliable data storage engine for applications requiring immense scale.
Data modeling is a process used to analyze, organize, and understand the data requirements for a product or service. Data modeling creates the structure your data will live in. It defines how things are labeled and organized, and determines how your data can and will be used. The process of data modeling is similar to designing a house. You start with a conceptual model and add detail to produce the final blueprint.
The ultimate goal of Cassandra data modeling and analysis is to develop a complete, well organized, and high performance Cassandra cluster. Following the five Cassandra data modeling best practices outlined will hopefully help you meet that goal:
Five Best Practices for Using Apache Cassandra
- Don’t try to use Cassandra like a relational database
- Design your model around 3 data distribution goals
- Understand the importance of the Primary Key in your data structure
- Model around your queries
- Conduct testing to ensure the performance of your mode.
Cassandra Is Not a Relational Database
Do not try to design a Cassandra data model like you would with a relational database.
Query first design: You must define how you plan to access the data tables at the beginning of the data modeling process not towards the end.
No joins or derived tables: Tables cannot be joined so if you need data from more than one table, the tables must be merged into a denormalized table.
Denormalization: Cassandra does not support joins or derived tables so denormalization is a key practice in Cassandra table design.
Designing for optimal storage: For relational databases this is usually transparent to the designer. With Cassandra, an important goal of the design is to optimize how data is distributed around the cluster.
Sorting is a Design Decision: In Cassandra, sorting can be done only on the clustering columns specified in the PRIMARY KEY.
The Fundamental Goals of the Cassandra Data Model
Distributed data systems, such as Cassandra, distribute incoming data into chunks called partitions. Cassandra groups data into distinct partitions by hashing a data attribute called partition key and distributes these partitions among the nodes in the cluster.
(A detailed explanation can be found in Cassandra Data Partitioning.)
A good Cassandra data model is one that:
- Distributes data evenly across the nodes in the cluster
- Place limits on the size of a partition
- Minimizes the number of partitions returned by a query.