Monitoring at Scale Series
In this series of blogs, explore the various ways we pushed our metrics pipeline—mainly our Apache Cassandra® cluster named Instametrics—to the limit, and how we went about reducing the load it was experiencing on a daily basis over the course of several years. Beginning in 2015 and up until now, discover how Instaclustr has continued to build and expand its metrics pipeline over the years:
Vol. 1: Monitoring Apache Cassandra® and IT Infrastructure (December 2015)
Vol. 2: Upgrading Instametrics to Apache Cassandra® 3 (November 2016)
Vol. 3: The Introduction of Kafka® (January 2022)
Vol. 4: Using Redis™ as a Cassandra® Cache (March 2022)
Vol. 5: Upgrading our Instametrics Cassandra® Cluster from 3.11.6 to 4.0 (June 2022)
Introduction
Efficient and effective monitoring of IT infrastructure is a common challenge faced by most modern organizations. This is particularly relevant with Software-as-a-Service (SaaS) companies as they strive to provide the best possible front end user experience as possible, with the lowest or shortest downtime.
At Instaclustr, we not only need to ensure that our customer facing environment is monitored and controlled, as an Infrastructure-as-a-Service (IaaS) company, more importantly, we also need to carefully monitor every node of every Cassandra cluster that we provision and manage in the cloud on behalf of our customers.
In the early days of Instaclustr, our monitoring system was rudimentary. It would trigger a large amount of alerts for the same issue, creating noise and at times resulting in some important ones being missed. Initially our monitoring system only worked on a limited number of metrics and to be honest, it really wasn’t designed to scale and support the growth that we have experienced this year.
All that was perfectly normal for a young startup, however, early in 2015, we decided to tackle the problem and concluded that we need a monitoring system that would:
- allow us to collect, process and alert on a large number of metrics;
- provide instantaneously relevant information for our support team to operate efficiently; and
- be capable of scaling rapidly with our growth.
We evaluated a number of solutions, from the full off-the-shelf product to an in-house solution, and decided on something somewhere in the middle. We identified that Riemann would support large scale monitoring while providing us flexibility in the design of monitoring rules we needed for Cassandra. We have used this technology as the basis of our monitoring system, with the some additional in-house developed capability to support our requirements.
Today, we can look back and be comfortable with this decision as I will explain below. I will also introduce a few of the most useful Riemann constructs, illustrated by an example of one of the many monitoring rules we have in production.
Why Riemann?
Riemann is a monitoring system that has the ability process efficiently and concurrently a very large numbers of events sent by any server that needs to be monitored. But… what is an event exactly? From a very abstract perspective, an event is anything that happens at some point in time and reports on some information about it.
From an IT monitoring perspective an event is typically some information sent by a host, about a particular service, reporting on some information about it. A customer provisioning a cluster? An event. A node starting? Another event. Cassandra starting and ready to serve requests? Yet another event. Events are very useful to know about the state of the infrastructure. Some events by nature are recurring and report on observable metrics at the OS level such as the disk usage on a particular partition, or maybe at the application level, such as the number of pending compaction in Cassandra. Those types of events are metrics, and they are very useful for detecting problems. Riemann provides the capacity to write complex rules in an efficient, concise and natural way that efficiently processes the stream of incoming events and to then act accordingly.
Today, we collect more than 3500 thousand metrics on each of the nodes we monitor. We send metrics every 10 seconds to Riemann. We have in place a large number of rules that process those events: Latency, disk usage, outlier nodes in a cluster, garbage collection, compaction, heartbeat, service down… when a node is unhealthy, Riemann triggers a PagerDuty alert so that our support team can immediately start taking actions. And all those metrics are stored back into our Cassandra cluster for later analysis if needed.
Example of a Riemann Rule
Writing Riemann rules is done by using some Riemann constructs called ‘streams’. We can also create our own ‘stream’ in Clojure, the language used by Riemann, to add more functionality.
In the following section, I will progressively introduce some of the most common Riemann constructs, and I will illustrate them with a simple example rule: Heartbeat Monitoring.
1. An event
In Riemann, an event is a data structure – a map – of a set of predefined fields, the most important being:
- Host: A string that identifies which host is reporting an event / metric.
- Service: A string that specifies what type of event / metric is being reported.
- Metric: A single value used to quantify the state of the service.
- Time: The timestamp (in epoch time) of the event.
- TTL: Time To Live in seconds indicating how long the metric is considered valid for.
As Riemann is written in Clojure, such an event can be defined as:
A map is just a collection that maps keys to value. Accessing a field, such as the metric, of my_event can be done with the following (which will return 0.88):
2. The (streams …) section.
In Riemann, the (streams …) section is the source of all events that Riemann receives. Every rule has to be written underneath this section, and every single events will flow in this section. The dot-dot-dot notation represents the child streams.
3. The (where …) stream.
In this example, we want to monitor the Heartbeat. This means that every node in the infrastructure will send an event at regular interval, with a service named “No-Heartbeat”. In the (streams …) section, we can filter this service with the (where …) statement and a predicate:
This means that only the events with the service equal to “No-Heartbeat” will be passed on to the child streams (represented as dot-dot-dot).
4. Triggering an alert
The first thing we can do when receiving an event is to print it to the console. This can easily be achieved with:
Riemann comes with built-in integration with alerting services such as PagerDuty. If we want to send a PagerDuty alert in addition to printing to the console an event with service = “No-Heartbeat”, we can write:
Notice that the (where …) streams now has two child streams: prn and the pagerduty section. Events that pass the (where (service “No-Heartbeat”)) get pass to the two child streams (more precisely, a copy of the event get pass on).
5. The (throttle …) stream
The code above will effectively trigger a pagerduty alert on every “No-Heartbeat” event that we receive. That could be too noisy. Riemann comes with prebuilt control flow functionality. One of them is the (throttle …) stream which controls how many events get passed on to the child streams for a given period of time. Let’s say we want to receive only one pagerduty alert per hour, and print to the console only one event per minute. We could do so with:
6. The (by …) stream
The problem with the throttling we just introduced above is that it doesn’t distinguish between all the hosts that are monitored. In other words, if two hosts send the “No-Heartbeat” service at approximately the same time, only the first one will trigger the alert. What we really want is to limit the number of alerts per host. The (by :host …) statement will effectively create a copy of the downstream stream, one for each host. This allows distinguishing every host. The code can be improved with:
Of course, (by … ) can be used with other events field, such as :service. (by …) can be also combined, such as (by [:host :service] …). However, with a large number of host and service, the combination can easily be too large, and it is not recommended to do so when the number of bifurcations exceeds a few hundred thousand possibilities.
7. The index
So far, we wrote a simple example on how to alert when we receive a “No-heartbeat” message. But instead, what we really want is to alert when we don’t receive a “Heartbeat” anymore. We are now going to work under the assumption that every host to monitor sends a Heartbeat at regular intervals, and when it stops, we want to trigger an alert. In order to achieve that, we need to keep track of the last time a “Heartbeat” for a specific host has been received. That’s the role of the ‘index’. The index is a Riemann in-memory data structure that keeps track of all the events (by host/service pair) that we want to store in it, for as long as the events keep being updated (same host/service). If an event is no longer updated, it expires according to its TTL. In our case, we need to index the “Heartbeat” service, and we are going to index it with a default ttl of 10 seconds.
A few comments here:
- Riemann expires at regular interval events from the index with an expired TTL. Expired events flow again into the (streams …) section with the state “expired”.
- When filtering the “Heartbeat” service in the (where …) statement, we now need to distinguish events that arrives with the state “expired”: We don’t want to alert on non-expired Heartbeat.
- Clojure uses the prefix notation: Logical AND operator is done like that: (AND a b).
8. async queue.
We are almost there. There is one more problem in this code: Pagerduty being an external service, we need to contact it asynchronously for better performance. Riemann allows to wrap streams in an async-queue.
In this case, we defined an async queue with a max number of threads of 100, starting at 4.
9. Putting it all together.
Because an example is never complete without a fully working file, you can get the source code on our public Instaclustr github project: https://github.com/instaclustr/sample-Riemann/tree/master/Heartbeat#simple-heartbeat-monitoring-with-riemann It contains a working riemann.config file, a basic python client to send the heartbeat events, and a readme with instructions. All you need to get started!
Conclusion
This example is not so far from one of the rules we have in production. While heartbeat monitoring can easily be delegated to external services, it is an interesting example to illustrate a few key concepts and constructs of Riemann and get you started with something useful. The Clojure syntax can be a bit confusing at the beginning, but passed the learning curve, writing rules becomes natural.
Stay tuned for my next blog post where I will explain how to write metrics into a Cassandra cluster!