Overview
Apache Cassandra 3.0 introduces a new feature called materialized views. Materialized views behave like they do in other database systems, you create a table that is populated by the results of a query. Cassandra also keeps the materialized view up to date based on the data you insert into the base table. Whilst the feature itself sounds very simple, it becomes very powerful when working with a denormalized schema where you often end up writing the data multiple times in a way that will fit future reads.
By leveraging materialized views you can have some of this logic live in the database and let Cassandra keep everything up to date (in a somewhat eventually consistent manner). Materialized views also allow you to replace some of the functionality given to us by secondary index, which can be painful to manage and create performance bottlenecks. Below is a basic example of how you can use materialized views to replace a secondary index for much better performance.
This example is based on a the idea of a dating app whereby users are matched with other users based on an arbitrary algorithm (in this example its not relevant). The matches are stored in Cassandra and the users can accept or reject the matches as they see fit.
Example
First lets create our basic schema.
1 |
CREATE KEYSPACE netflix_and_chill WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1' : 3}; |
Our application will have replication factor of 3
1 2 3 4 5 6 7 |
CREATE TABLE IF NOT EXISTS netflix_and_chill.users ( user_id uuid, first_name text, last_name text, email text, PRIMARY KEY (user_id) ); |
Our base table will contain all our users and some basic information.
1 2 3 4 5 6 7 8 9 10 11 |
CREATE TABLE IF NOT EXISTS netflix_and_chill.user_matches ( user uuid, matched_user uuid, state text, year int, month int, day int, PRIMARY KEY ((user, matched_user)) ); CREATE INDEX user_match_idx ON netflix_and_chill.user_matches (state); |
This table will contain a set of matched user pairs. Normally this is generated by our magic machine learning pipeline built with spark… but for this example we’ll just use some dummy data. We will also create our secondary index so we can query the user_matches
table by user_id
and state
. This will allow our app to display a list of ACTIVE
matches to the logged in user.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 'Jordan', 'Smith', '[email protected]'); INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (8c3fb75c-a713-4750-b945-c51074257643, 'Charlie', 'Green', '[email protected]'); INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (3bf70628-9f24-46bc-95ff-c70eb2486ea4, 'Jamie', 'Fletcher', '[email protected]'); INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (25fc8d63-673d-4d43-8735-582bfe2646d6, 'Emerson', 'Jones', '[email protected]'); INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (4c3fb75c-a713-4750-b945-c51074257643, 'Casey', 'Ali', '[email protected]'); INSERT INTO netflix_and_chill.users (user_id, first_name, last_name, email) VALUES (5bf70628-9f24-46bc-95ff-c70eb2476e96, 'Amari', 'Wiat', '[email protected]'); // Some basic matches INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 8c3fb75c-a713-4750-b945-c51074257643, 'ACTIVE', 2015, 11, 17); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (8c3fb75c-a713-4750-b945-c51074257643, 3bf70628-9f24-46bc-95ff-c70eb2486e96, 'ACTIVE', 2015, 11, 17); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (5bf70628-9f24-46bc-95ff-c70eb2476e96, 25fc8d63-673d-4d43-8735-582bfe2646d6, 'ACTIVE', 2015, 11, 17); //What about this super popular person Jamie? INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 95fc8d63-673d-4d43-8735-582bfe26e6d6, 'ACTIVE', 2015, 11, 17); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 8c3fb75c-a713-4750-b945-c51074257643, 'ACTIVE', 2015, 11, 17); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 3bf70628-9f24-46bc-95ff-c70eb2486ea4, 'ACTIVE', 2015, 11, 17); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 25fc8d63-673d-4d43-8735-582bfe2646d6, 'ACTIVE', 2015, 11, 18); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 4c3fb75c-a713-4750-b945-c51074257643, 'ACTIVE', 2015, 11, 18); INSERT INTO netflix_and_chill.user_matches (user, matched_user, state, year, month, day) VALUES (95fc8d63-673d-4d43-8735-582bfe26e6d6, 5bf70628-9f24-46bc-95ff-c70eb2476e96, 'ACTIVE', 2015, 11, 18); |
Excellent, now that we have some dummy data let’s see what matches will show up when the user Jordan decides to login.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
cqlsh> SELECT * FROM netflix_and_chill.user_matches WHERE user = 95fc8d63-673d-4d43-8735-582bfe26e6d6 AND state = 'ACTIVE' ALLOW FILTERING; user | matched_user | day | month | state | year --------------------------------------+--------------------------------------+-----+-------+--------+------ 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 4c3fb75c-a713-4750-b945-c51074257643 | 18 | 11 | ACTIVE | 2015 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 8c3fb75c-a713-4750-b945-c51074257643 | 17 | 11 | ACTIVE | 2015 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 17 | 11 | ACTIVE | 2015 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 5bf70628-9f24-46bc-95ff-c70eb2476e96 | 18 | 11 | ACTIVE | 2015 95fc8d63-673d-4d43-8735-582bfe26e6d6 | 25fc8d63-673d-4d43-8735-582bfe2646d6 | 18 | 11 | ACTIVE | 2015 (5 rows) Tracing session: 5e451df0-92fc-11e5-9bd3-e99bb43e31f8 activity | timestamp | source | source_elapsed --------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+---------------- Execute CQL3 query | 2015-11-24 14:40:31.567000 | 127.0.0.1 | 0 ... Truncated for brevity... Preparing statement [SharedPool-Worker-1] | 2015-11-24 14:40:31.568000 | 127.0.0.1 | 351 Computing ranges to query [SharedPool-Worker-1] | 2015-11-24 14:40:31.683000 | 127.0.0.1 | 115863 Index mean cardinalities are user_match_idx:1. Scanning with user_match_idx. [SharedPool-Worker-1] | 2015-11-24 14:40:31.704000 | 127.0.0.1 | 136184 Submitting range requests on 1537 ranges with a concurrency of 1537 (0.001171875 rows per range expected) [SharedPool-Worker-1] | 2015-11-24 14:40:31.705000 | 127.0.0.1 | 137938 Enqueuing request to /127.0.0.5 [SharedPool-Worker-1] | 2015-11-24 14:40:31.710000 | 127.0.0.1 | 142593 Enqueuing request to /127.0.0.6 [SharedPool-Worker-1] | 2015-11-24 14:40:31.712000 | 127.0.0.1 | 144504 Enqueuing request to /127.0.0.2 [SharedPool-Worker-1] | 2015-11-24 14:40:31.712000 | 127.0.0.1 | 144601 Enqueuing request to /127.0.0.4 [SharedPool-Worker-1] | 2015-11-24 14:40:31.713000 | 127.0.0.1 | 145989 Enqueuing request to /127.0.0.6 [SharedPool-Worker-1] | 2015-11-24 14:40:31.713000 | 127.0.0.1 | 146089 Enqueuing request to /127.0.0.4 [SharedPool-Worker-1] | 2015-11-24 14:40:31.714000 | 127.0.0.1 | 146162 Enqueuing request to /127.0.0.3 [SharedPool-Worker-1] | 2015-11-24 14:40:31.714000 | 127.0.0.1 | 146214 ... Truncated for brevity... |
The actual trace result shows over 2500 different operations, but the key thing to note is the co-ordinator started to enqueue requests to *all* other nodes in the cluster, despite only having a replication factor of 3. This means as our netflix and chill app grows in popularity and we have to add more nodes our queries will get slower as the index is distributed over all nodes.
Enter the materialized view, it allows us to recreate this secondary index but as a normal Cassandra table where we have total control over the partition key.
1 2 3 4 |
CREATE MATERIALIZED VIEW user_match_states AS SELECT user, matched_user FROM netflix_and_chill.user_matches WHERE user IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND state IS NOT NULL AND day IS NOT NULL AND matched_user IS NOT NULL PRIMARY KEY ((user), state, matched_user); |
Here we have told Cassandra to create a new table with data from the user_matches table with a new partition key format. We were even able to include new fields in our primary key. You can create a materialized view at any point and Cassandra will set off a back ground process to populate your materialized view with data from the existing table.
Let’s now run that same query against our new materialized view.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
cqlsh> SELECT * FROM netflix_and_chill.user_match_states WHERE user = 95fc8d63-673d-4d43-8735-582bfe26e6d6 AND state = 'ACTIVE'; user | state | matched_user --------------------------------------+--------+-------------------------------------- 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 25fc8d63-673d-4d43-8735-582bfe2646d6 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 4c3fb75c-a713-4750-b945-c51074257643 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 5bf70628-9f24-46bc-95ff-c70eb2476e96 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 8c3fb75c-a713-4750-b945-c51074257643 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 95fc8d63-673d-4d43-8735-582bfe26e6d6 (5 rows) Tracing session: edfdf030-9305-11e5-9bd3-e99bb43e31f8 activity | timestamp | source | source_elapsed ----------------------------------------------------------------------------------------------------------+----------------------------+-----------+---------------- Execute CQL3 query | 2015-11-24 15:48:58.168000 | 127.0.0.1 | 0 ... Truncated for brevity... Preparing statement [SharedPool-Worker-1] | 2015-11-24 15:48:58.195000 | 127.0.0.1 | 28138 Executing single-partition query on user_match_states [SharedPool-Worker-2] | 2015-11-24 15:48:58.225000 | 127.0.0.1 | 57920 Acquiring sstable references [SharedPool-Worker-2] | 2015-11-24 15:48:58.225000 | 127.0.0.1 | 58155 Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-11-24 15:48:58.226000 | 127.0.0.1 | 59049 Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2015-11-24 15:48:58.227000 | 127.0.0.1 | 59907 Read 5 live and 0 tombstone cells [SharedPool-Worker-2] | 2015-11-24 15:48:58.236000 | 127.0.0.1 | 68904 Request complete | 2015-11-24 15:48:58.236329 | 127.0.0.1 | 68329 |
Wow that is a world of difference, the query only hit the primary replica as the query now matches the primary key for that table!
Users like Jordan now get their matches much quicker now!
Let’s see what happens when Jordan swipes left:
1 2 3 4 5 6 7 8 9 10 11 |
UPDATE netflix_and_chill.user_matches SET state = 'DECLINED' WHERE user = 95fc8d63-673d-4d43-8735-582bfe26e6d6 AND matched_user = 5bf70628-9f24-46bc-95ff-c70eb2476e96; cqlsh> SELECT * FROM netflix_and_chill.user_match_states WHERE user = 95fc8d63-673d-4d43-8735-582bfe26e6d6 AND state = 'ACTIVE'; user | state | matched_user --------------------------------------+--------+-------------------------------------- 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 25fc8d63-673d-4d43-8735-582bfe2646d6 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 4c3fb75c-a713-4750-b945-c51074257643 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 8c3fb75c-a713-4750-b945-c51074257643 95fc8d63-673d-4d43-8735-582bfe26e6d6 | ACTIVE | 95fc8d63-673d-4d43-8735-582bfe26e6d6 (4 rows) |
The change to user_matches has flowed through to the materialized view user_match_states
. Perfect!
There are many other tasks you’ll be able to accomplish with materialized views and as support matures you’ll be able to mix in things like user defined functions and aggregations, for even more powerful materialized views.