At Instaclustr we are constantly enhancing our monitoring and security systems based on learnings from the large and varied Cassandra deployments we manage in production. We will shortly be releasing some significant improvements that include minor changes to customer clusters.
We have recently replaced our monitoring architecture with a Reimann-based system. This change was designed to give us a stronger, more-flexible base upon which to constantly tune and enhance our monitoring as we learn from our experiences running Cassandra for our customers.
There are two particular changes about to be released, which we think will be particularly useful in diagnosing and pre-empting latency issues specifically:
- Unbalanced node latency monitoring. We are introducing alerting in our monitoring system where latency for one node gets significantly out of step with the latency of other nodes in the system. This will typically be the case where there are issues with the underlying cloud infrastructure (noisy neighbour or underlying hardware issues). In these circumstances we can use our automated provisioning system and Cassandra’s ability to automatically populate a newly provisioned node to move the node to a newly provisioned VM without interruption to service.
- Synthetic transaction monitoring. We are introducing synthetic transaction monitoring of Cassandra read and writes in customer clusters. This monitoring will schedule and measure a common series of simple, controlled Cassandra reads and writes into a segregated keyspace in each cluster. This will allow us to measure latency and availability without variance caused by customer schemas and query patterns which means that (a) we can more closely tune monitoring thresholds based on this data, detecting problems earlier and (b) when diagnosing issues we will have more information to differentiate infrastructure/architecture issues from customer-specific query, data or schema issues.
To implement the synthetic transaction monitoring, we will be creating a schema called “Instaclustr” in each managed cluster. This schema contains a single table and we write and read 30 records to the table every 5 seconds. TTL (time to live) on the table is set to that there should never be 60 live records in the table at any time. Processing load and disk space usage will therefore be minimal.
As part of implementing this change, we have also enhanced security around our processes for gaining access to password-protected clusters for system and administrative usage. The new method is based on a custom authenticator class for Cassandra, which allows access to a single admin user based on a complex per-node password, which is regenerated every 5 minutes. Access to this password is only available to users who already have SSH access to the machine for admin purposes.
These changes will be rolled out across all managed clusters over the next few weeks. For customers that are currently using authentication with their clusters there will be specific communication from Instaclustr support to ensure there are no unintended impacts from the change.
We’re excited to be releasing these changes, which we think will be another great step forward in enhancing the reliability and security of cluster under our management.