(Source: Shutterstock)
In the first part of this blog (Cluster Creation in Under Ten Minutes), I created a Cassandra cluster. In this part, we blast off to the Moon for 2nd contact.
Consulting the Oracles
Croesus: Hi Oracle. How will my war with Cyrus the Persian go?
Oracle: If you proceed, a great empire will be destroyed.
Croesus: Great – Attack!
What happened? Croesus was defeated as he had incorrectly assumed it would be the Persian empire that would be destroyed.
King Priam of Troy: Hi Cassandra. The Trojans have left an offering of a wooden horse at the gate. Is it safe to bring it inside?
Cassandra: No! Leave it alone, it’s a trick! The Trojan soldiers are inside and Troy will be destroyed!
King Priam: That’s a Yes then. Bring the horse inside the gates!
What happened? Troy was sacked. Cassandra was a prophetess but had been cursed by Apollo so that no one believed her.
(Source: Shutterstock)
Let’s explore one way to connect to and consult Cassandra…
Consulting Cassandra (CQLSH)
Now that my Cassandra cluster is up and running, what can I do with it? How do I talk to it? What sort of clients are supported? What’s the protocol? How does the client find the cluster? Is the client just another node in the gossip protocol? (No). Is there a default REST API? (No).
The Apache docs say that the API to Cassandra is CQL, the Cassandra Query Language. To use CQL, you will need to connect to the cluster, which can be done:
- either using cqlsh,
- or through a client driver for Cassandra.
Here’s the general Instaclustr documentation on Connecting to a Cluster, and the specific documentation for Connecting to Instaclustr Using Cqlsh.
The 1st thing you need to do is to obtain the Apache Cassandra installation including the cqlsh shell. You don’t need to install or run Cassandra on your machine to run cqlsh. Using cqlsh allows you to check that you can connect to your newly created Instaclustr Cassandra Cluster and is also useful for debugging even after you have a Java etc client working.
Downloading Cassandra – using a browser click on the link to the Latest Version on this page and it will take you to mirror options for you to download from. Once downloaded and you’ve extracted/unzipped it (reminder, no need to start it), start a command line on your preferred OS, go to the Cassandra bin directory and type the magic incantation to connect:
1 |
./cqlsh public_ip_of_a_node_in_your_cluster 9042 -u username -p your_password |
Cassandra drivers must be provided with the address of at least one node in the cluster and will auto-discover the remaining topology (all the IP addresses) of the cluster via the CQL binary protocol. The driver maintains a connection to the cluster and will keep this internal topology info up-to-date (Cassandra uses the gossip protocol internally to build this topology information). Subsequently the client can connect to any node (depending on the load balancing policy).
To connect to the cluster we need a public ip address of at least one node, and username and password. Log into https://console.instaclustr.com and click on the name of the trial cluster created previously. Under details, you will see the 3 nodes and their Public IP addresses. Under Settings (Firewall rules) you can check (or add) client IP addresses that are allowed to connect to the Cassandra cluster. Click on the Connection Information Tab for detailed connection information. This page shows all the public IP addresses of the nodes, and the username/password. For cqlsh connection just pick any one public IP address.
I have a confession. I’ve heard that CQL is “like” SQL which makes it confusing for people with a SQL background. I may be lucky! The only experience I have with SQL is a group project at university to design and implement a SQL database. Once we built it we didn’t actually use it for anything useful. Oh, and I built a system for distributed Oracle instances (with an eventual time to the consistency of 30 minutes). So I’m not (yet) confused by CQL, I just have no preconceptions at all.
Once the cqlsh is up and running you can consult Cassandra. I tried typing “Hi Cassandra;” but only got “SyntaxException: line 1:0 no viable alternative at input ‘hi’ ([hi]…)”. Time to read the CQL commands docs or type “help;”.
DESCRIBE is useful. E.g. “DESCRIBE keyspaces;”, “DESCRIBE tables;”
DESCRIBE keyspace Instaclustr;
will show information about all the tables in the Instaclustr keyspace (one that our management system creates in every cluster we manage for synthetic transaction monitoring and other uses).
SELECT * from keyspace.table;
is useful for looking at things (assuming there are not too many!) Once you’ve found a keyspace and table then try it out.
SELECT peer from system.peers;
gives you the IP addresses of the other 2 nodes (that you didn’t connect to via the client) in the cluster.
Note that we recommend that you immediately change the superuser password, and create a non-superuser account. Here’s how to change the password.
LIST users;
And then:
ALTER USER iccassandra WITH PASSWORD ‘newpassword’;
And while we’re at it let’s make sure HAL can’t take over by creating a non super user:
CREATE user hal9000 with password ‘ImsorryDave’ NOSUPERUSER;
And allow hal900 to do things:
GRANT ALL ON ALL KEYSPACES TO hal9000;
EXIT;
And reconnect again as hal9000.
Note that roles (allowing more sophisticated security) are supported in the current versions of Cassandra:
LIST roles;
What else can you do? Well, you can try creating a keyspace, a table, inserting some rows, selecting rows, etc. However, this requires some understanding of Cassandra concepts including keyspaces, tables, primary keys, partition keys, clustering columns, etc.
Here is the relevant documentation and some definitions:
https://cassandra.apache.org/doc/latest/cql/ddl.html#create-keyspace
https://cassandra.apache.org/doc/latest/cql/dml.html
Keyspace
A namespace container that defines how data is replicated on nodes. Similar to a “database” in a RDBMS.
Table (Column Family)
A container for rows, similar to the table in a relational system. Called table in CQL 3.
Primary Key
The partition key. One or more columns that uniquely identify a row in a table.
Clustering Column
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not part of the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Compound Partition Key
A partition key consisting of multiple columns.
Compound Primary Key
A primary key consisting of the partition key, which determines on which node data is stored, and one or more additional columns that determine clustering.
Here are some examples of possible primary keys:
Simple Primary Key, Simple Partition Key
PRIMARY KEY(partition1)
Simple Primary Key, Compound Partition Key
PRIMARY KEY((partition1, partition2))
Compound Primary Key, Simple Partition Key
PRIMARY KEY(partition1, clustering1)
PRIMARY KEY(partition1, clustering1, clustering2)
Compound Primary Key, Compound Partition Key
PRIMARY KEY((partition1, partition2), clustering1)
PRIMARY KEY((partition1, partition2), clustering1, clustering2)
Let’s create a table and some data. First, create a keyspace with some probably naive defaults. I called it dev.
CREATE KEYSPACE dev WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: ‘3’} AND durable_writes = true;
Have a look at it with:
DESCRIBE KEYSPACE dev;
Now create a table for fake sensor data to store data of the form:
<’host1’, ‘temp’, ‘sometime’, 22.2>
CREATE TABLE dev.sensordata( host text, metric text, time timestamp, value double, PRIMARY KEY ((host, metric), time) ) WITH CLUSTERING ORDER BY (time ASC);
What does this do? Create a new table in the dev keyspace with the name sensordata with the columns: host, metric, time, value. The partition key is compound and is made up of host AND metric. All queries will need to specify values for both of these columns. The time column is a clustering column (order) with ascending order (past time to now).
Now type:
DESCRIBE TABLE dev.sensordata;
And you get all this back:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
CREATE TABLE dev.sensordata ( host text, metric text, time timestamp, value double, PRIMARY KEY ((host, metric), time) ) WITH CLUSTERING ORDER BY (time ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99PERCENTILE'; |
To insert some data type:
1 |
INSERT INTO dev.sensordata(host, metric, time, value) VALUES ('host1', 'temp', toTimestamp(now()), 22.2); |
And then:
SELECT * from dev.sensordata;
host | metric | time | value
——-+———-+————————————————-+——–
host1 | temp | 2017-08-24 06:01:14.504000+0000 | 22.2
If you repeat the identical insert and then select again you will get 2 rows because the time is different:
host | metric | time | value
——–+——–+————————————————–+——–
host1 | temp | 2017-08-24 06:01:14.504000+0000 | 22.2
host1 | temp | 2017-08-24 06:02:35.412000+0000 | 22.2
Notice that the results are returned in increasing time order.
Now let’s try out the WHERE clause. I’ve added some more rows and now have 2 hosts and 2 metrics:
host | metric | time | value
——–+————-+————————————————–+——-
host2 | temp | 2017-08-24 06:05:34.893000+0000 | 20.4
host2 | temp | 2017-08-24 06:05:43.805000+0000 | 22.1
host2 | temp | 2017-08-24 06:05:49.333000+0000 | 20.1
host1 | pressure | 2017-08-24 06:06:21.136000+0000 | 44
host1 | pressure | 2017-08-24 06:06:29.276000+0000 | 42
host2 | pressure | 2017-08-24 06:06:11.406000+0000 | 33
host2 | pressure | 2017-08-24 06:06:14.758000+0000 | 44
host1 | temp | 2017-08-24 06:01:14.504000+0000 | 22.2
host1 | temp | 2017-08-24 06:02:35.412000+0000 | 22.2
host1 | temp | 2017-08-24 06:03:44.405000+0000 | 10.9
host1 | temp | 2017-08-24 06:03:51.538000+0000 | 11.9
To get rows for host1 AND temp only type:
SELECT * from dev.sensordata where host=’host1′ and metric=’temp’;
host | metric | time | value
——-+———-+—————————————————+——-
host1 | temp | 2017-08-24 06:01:14.504000+0000 | 22.2
host1 | temp | 2017-08-24 06:02:35.412000+0000 | 22.2
host1 | temp | 2017-08-24 06:03:44.405000+0000 | 10.9
host1 | temp | 2017-08-24 06:03:51.538000+0000 | 11.9
Notice that we had to specify both of the compound partition key columns in the where clauses. You can also add a where clause over the clustering key (with inequalities to).
SELECT * from dev.sensordata where host=’host1′ and metric=’temp’ and time>toTimestamp(now());
Returns nothing (predictably, as we are asking for future events). Now try some other select queries with missing where clauses. For example the following won’t work as the partition keys are missing (only the clustering key is provided):
SELECT * from dev.sensordata where time>toTimestamp(now());
InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING”
What does ALLOW FILTERING do?
By default, CQL only allows select queries that don’t involve “filtering” server side, i.e. queries where we know that all (live) record read will be returned (maybe partly) in the result set. The reasoning is that those “non filtering” queries have predictable performance in the sense that they will execute in a time that is proportional to the amount of data returned by the query (which can be controlled through LIMIT).
The ALLOW FILTERING option allows to explicitly allow (some) queries that require filtering. Please note that a query using ALLOW FILTERING may thus have unpredictable performance (for the definition above), i.e. even a query that selects a handful of records may exhibit performance that depends on the total amount of data stored in the cluster. … However, if you “know what you are doing”, you can force the execution of this query by using ALLOW FILTERING (I don’t so I won’t).
Aggregate functions (min, max, avg, count, sum) will also work on this table (it would be more sensible to have a time range where clause):
SELECT max(value) from dev.sensordata where host=’host1′ and metric=’temp’ and time<toTimestamp(now());
system.max(value)
————————–
22.2
Next blog part:
Connecting to Cassandra With a Java Client
NOTES
1 In this simple example there is no bucket (time) column which is common in Cassandra time series data tables (but is it really needed?)
2 I made time order ascending. Descending is more common for time series data when you are interested in the most recent data first.
3 How do you know what the possible host and metric names are? Try this:
SELECT distinct host,metric from dev.sensordata;
host | metric
——-+———-
host2 | temp
host1 | pressure
host2 | pressure
host1 | temp