We often get questions from customers about the best way to add capacity to their cluster. Is it better to add nodes, or simply to increase the capacity in their nodes? Unfortunately, the truth is there is no best way—like all complex issues in distributed systems, there are benefits and drawbacks to each scaling approach.
While each of our highly distributed systems (Apache Cassandra, Apache Kafka, etc.) have slightly different implementations of scaling, the concepts remain consistent across most distributed systems.
There can be many reasons for needing to scale your cluster. However, we have found that the 2 most common are:
- Not enough disk capacity
- Not enough processing capacity
What Is Horizontal Scaling?
Horizontal scaling, or scaling out, is the act of adding more of the same size nodes to your cluster. For example, this is going from a 6-node cluster to a 9-node cluster. This form of scaling can be broken up into 3 stages:
Before Adding Nodes: At this stage, you are running a 6-node cluster. You may realize that you need to increase capacity, so you make the decision to add 3 nodes to your cluster.
Re-Streaming Data: During this stage you have an additional 3 nodes coming online to your cluster to add capacity. However, before these new nodes can begin servicing requests, they first need to stream the relevant partitions of data onto their own disk. Cassandra refers to this stage as “Joining”. Depending on the amount of data in your cluster, and size of your nodes, this can take hours or days to complete.
The major thing to consider during this stage is that although you have added 3 nodes to your cluster, you actually have reduced processing capacity! That’s because your original 6 nodes are still servicing all your requests, but now also servicing the streaming requests from the joining nodes! It’s for this reason that we never recommend scaling horizontally on a cluster which is under a high processing load.
Cleaning Up Old Data: At this stage, you have copied all your data to your new nodes, and they are now servicing both read and write requests from your application. However, your original nodes are now storing an additional copy of the data that they streamed to the newly joined nodes! We need to reclaim that used disk space. In Cassandra, this is known as a “Cleanup” and will need to be run in order to get back that disk space.
Scaling Back Down: If you decide later that you wish to reduce the capacity of your cluster back down to 6, then you will need to run a decommissioning process. This process involves ensuring that the copies of data that are on your node you wish to remove are relocated to the correct remaining nodes. Depending on the size of your cluster, and the amount of data, this can take hours to days to complete.
What Is Vertical Scaling?
Vertical scaling, or scaling up, is when you increase the size of the nodes in your cluster, providing increased processing capacity, or disk size. This can be broken up into 4 main stages.
Before Increasing Capacity: At this stage, you are running a 6-node cluster. You again realize that you need to add capacity and decide to double the storage and processing capacity on each node. You will perform the next 2 steps one by one on each node:
Taking the Instance Down (Instance Increase): If you are increasing the instance size you are utilizing then you will first need to stop the instance before you can resize it in your chosen cloud provider. It’s again worth highlighting that at this stage you will be reducing both processing and disk capacity in your cluster while the instances are down.
If you do wish to only increase the size of your storage, this can be achieved without taking down the application, which does not result in any reduction in processing capacity.
Performing the Resize: During this stage you will resize the instance and/or disk, in your chosen cloud provider. Once that has been completed, you will need to restart the instance and ensure the cluster is healthy before moving onto the next instance until they are all completed. Depending on the application, you may need to modify some of the application settings to utilize the additional resources.
Generally, you can complete the vertical scaling in around 15 minutes per node.
It’s worth highlighting that if you are using an instance type with attached storage, vertical scaling generally requires replacing each instance with a larger size, and re-streaming the data. We have implemented the Advanced Node Replace mechanism, which is able to reduce the performance impact during these operations for instance store nodes running Cassandra.
Scaling Back Down: If you decide later that you wish to reduce the capacity of your cluster back down to the smaller instance size, it depends on if you need to reduce the instance size or the disk size.
Instance Only: If you are only reducing the size of the instance and not the size of the disk, you can essentially repeat the process, but instead of increasing the instance size, you can decrease it.
Reducing Disk: Reducing the size of a disk is not as straightforward—some cloud providers will not let you reduce the size of networked disk storage. If you need to scale the disk down also, then you will need to essentially replace your node, which usually involves a re-streaming process moving around a large amount of data.
Horizontal vs Vertical Scaling – Which Is Right for You?
Both horizontal and vertical scaling are great options, and both suit different use cases.
Vertical scaling is great when you need to quickly add capacity to a cluster. When you only increase the size of the instance, without modifying the underlying storage, you also retain the ability to quickly reduce the cluster back down in size. Vertical scaling is what our Customer-Initiated Resize feature automates for customers. We recommend vertical scaling for short-term or seasonal peaks in processing requirements. This is because after the operation is completed, you can reduce the size of your cluster quickly.