It’s all about the timing!
Instaclustr has recently implemented it own NTP (Network Time Protocol) cluster to be used by the Cassandra clusters under our management environment. This is another example of how we are working to provide the best possible management and reliability for our customer’s Cassandra clusters.
NTP, as most technical people know, is a protocol to allow servers to synchronise their clocks based on the “correct” time published by other servers on the Internet. Most people setting up a server use ntpd or similar to connect to an ntp server and keep the server’s idea of time up to date. However, Cassandra can be quite sensitive to small difference in time between servers and publicly available ntp services can have quite significant differences in what they advertise as the correct time.
This issue, along with the solution is describe in two excellent blog posts by Viliam Holub from logentries:
- https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/
- https://blog.logentries.com/2014/03/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
Up until now, we have been using publicly available ntp sources for the Cassandra servers we manage at Instaclustr. Like most people managing small to medium clusters, we have seen relatively few issues arising from time not being synchronised between servers. (We also check ntp synchronisation as part of our daily health checks of all managed nodes which undoubtedly helps.) However, we have occasionally seen issues due to time synchronisation on servers and we’re on a continual mission to eliminate as many of the causes of occasional issues with our customers clusters as we possibly can.
So, with our most recent release we have deployed our own ntp cluster configured according to the solution described by Viliam above. All new clusters will use this ntp service immediately and existing customers will be migrated to use this service as part of normal upgrades over the coming weeks.
This is another example of the type of nagging issues that is very hard to find time to address when you are running your own Cassandra cluster but that we have already addressed for customers using our managed service. It’s an issue that might only impact a single cluster once a year or less but removing the once-a-year issues adds up to a noticeable improvement in reliability of your service.
This is another important step forward for Instaclustr as the most reliable way to run Cassandra for your application.