Creating a ClickHouse Cluster with Tiered Storage

ClickHouse allows for more than one block device to be designated for data storage. This not only allows using additional storage types, such as object storage like AWS S3, but also introduces support for storage tiering as found in typical hot-cold storage setups. It enables data being moved over to a relatively cheaper storage once the local disk is sufficiently filled or the data has sufficiently aged. This approach can significantly reduce overall storage costs. In most situations, for users with a large dataset this is likely the preferable storage configuration that they would want to use.

Instaclustr for ClickHouse lets you provision a ClickHouse cluster with storage tiering enabled.

Limitations 

Before you begin, please take note of the following limitations with Tiered storage for ClickHouse: 

  • Can only be used with RIYOA provider account setup.
  • Currently only AWS S3 can be used as remote storage.
  • Feature cannot be enabled/disabled on an existing cluster.
  • Only available for use by MergeTree and Log family table engines. 

How to configure your cluster for storage tiering 

Get your AWS S3 bucket ready 

To be able to use tiered storage, an AWS S3 bucket must be configured meeting the following conditions: 

  • It must be in the same region as the cluster.
  • Ideally it should be in the same account as the cluster is provisioned in. However, if it is in a different account, it must have bucket policy to allow access by the IAM role of the Cluster Data Centre (which would be created after the cluster has been provisioned). Check relevant documentation here.
  • It must not have any of the following features enabled:
    • Versioning
    • Object locking
    • Lifecycle rules
    • Intelligent-tiering archiving
    • Server-side encryption using AWS Key Management Service (KMS) 

Here are the steps to setup an S3 bucket:

  1. Login to your AWS account. Go to S3 > Create a bucket in the same region as the intended cluster.
  2. Give a name for the bucket.
  3. Leave Bucket Versioning disabled.

  4. For encryption, only Amazon S3 managed keys are currently supported.

  5. In Advanced settings, make sure Object Lock is disabled.

  6. Lastly, go ahead and create the bucket.

Provision a cluster with tiered storage 

Now that the bucket for remote storage is ready, you can proceed with cluster provisioning. For that head over to the Instaclustr Console to create a ClickHouse cluster. Select the Storage Tiering option in the ClickHouse Setup page.
Enable storage tiering

Once this option is checked, it requires S3 bucket name for remote storage and an optional prefix. Make sure bucket exists and meets all the requirements as mentioned in the bucket setup section above. Input the name of the bucket.
Bucket details and prefix

By default, a directory with the same name as the Cluster ID will be created at the root of the bucket to store cluster data. Ideally you should provide an empty bucket for remote storage. However, if you choose to use a bucket with other contents in it and you would like to isolate a preferred directory for ClickHouse use, you can optionally enter a prefix. If doing so, the root level directory will be named according to your given prefix (i.e. data will go into your-bucket:your-prefix/<Cluster_ID>/). Otherwise, leave the prefix blank and proceed with the remaining steps and provision your cluster. 

Alternatively, Instaclustr API or Terraform provider can be used to provision a ClickHouse cluster by including tiered storage details in the request body or Terraform file. 

Basic usage

Use of remote storage is governed by a couple of pre-configured storage policies. They are: 

  • ic_tiered: This policy will initially store data on local volume (hot). However, it will start pushing data to the remote storage once 60% of the local disk is used. When performing read operations, data will be pulled and cached on local disk. Cache can grow up to 20% the size of the local data disk.
    Important: For clusters where Tiered Storage has been successfully enabled and configured with a valid S3 bucket, it is not necessary to explicitly set the storage policy to ic_tiered at table creation time since we make it the default in that case. If a table however needs to be stored on local disk only, it is required to set storage_policy to ‘default’. Note that specifying ‘default’ is not required if the cluster does not have storage tiering enabled.
  • ic_remote_with_cache: This policy will force the entire table to be stored on remote storage. However, ClickHouse may start caching data when read operations are performed in the same way mentioned above. Depending on your read pattern and workload this policy may result in increased read latency. Therefore, it should be used only when it is suitable for your specific scenario. 

The following example shows the general structure of how these policies can be specified at table creation time: 

The tiered storage feature can also be used in combination with Table TTLs to enable movement of data from local to remote storage based on the age of data. Let’s take a look at the following table creation example:

With the above table definition in place, data will be moved from local volume to remove storage once they age 1 week. They would then get deleted from the remote storage when they age 5 weeks. 

What’s important to remember 

  • It is highly recommended that the bucket designated as remote storage is not used for any other purpose, as accidentally deleted/mutated data may not be recoverable/revertible.
  • Deleting a cluster will not automatically delete data stored in remote storage. 

Questions 

Please contact [email protected] for any further inquiries. 

 

By Instaclustr Support
Need Support?
Experiencing difficulties on the website or console?
Already have an account?
Need help with your cluster?
Contact Support
Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console
Spin up a cluster in minutes