Configure a ClickHouse Cluster for Tiered Storage

Get your AWS S3 bucket ready 

To be able to use tiered storage, an AWS S3 bucket must be configured meeting the following conditions: 

  • It must be in the same region as the cluster.
  • Ideally it should be in the same account as the cluster is provisioned in. However, if it is in a different account, it must have bucket policy to allow access by the IAM role of the Cluster Data Centre (which would be created after the cluster has been provisioned). Check relevant documentation here.
  • It must not have any of the following features enabled:
    • Versioning
    • Object locking
    • Lifecycle rules
    • Intelligent-tiering archiving
    • Server-side encryption using AWS Key Management Service (KMS) 

Here are the steps to setup an S3 bucket:

  1. Login to your AWS account. Go to S3 > Create a bucket in the same region as the intended cluster.
  2. Give a name for the bucket.
  3. Leave Bucket Versioning disabled.

  4. For encryption, only Amazon S3 managed keys are currently supported.

  5. In Advanced settings, make sure Object Lock is disabled.

  6. Lastly, go ahead and create the bucket.
  7. Create a ClickHouse cluster with Tiered Storage enabled following guides in Creating a ClickHouse Cluster.

Basic usage

Use of remote storage is governed by a couple of pre-configured storage policies. They are: 

  • ic_tiered: This policy will initially store data on local volume (hot). However, it will start pushing data to the remote storage once 60% of the local disk is used. When performing read operations, data will be pulled and cached on local disk. Cache can grow up to 20% the size of the local data disk.
    Important: For clusters where Tiered Storage has been successfully enabled and configured with a valid S3 bucket, it is not necessary to explicitly set the storage policy to ic_tiered at table creation time since we make it the default in that case. If a table however needs to be stored on local disk only, it is required to set storage_policy to ‘default’. Note that specifying ‘default’ is not required if the cluster does not have storage tiering enabled.
  • ic_remote_with_cache: This policy will force the entire table to be stored on remote storage. However, ClickHouse may start caching data when read operations are performed in the same way mentioned above. Depending on your read pattern and workload this policy may result in increased read latency. Therefore, it should be used only when it is suitable for your specific scenario. 

The following example shows the general structure of how these policies can be specified at table creation time: 

The tiered storage feature can also be used in combination with Table TTLs to enable movement of data from local to remote storage based on the age of data. Let’s take a look at the following table creation example:

With the above table definition in place, data will be moved from local volume to remove storage once they age 1 week. They would then get deleted from the remote storage when they age 5 weeks. 

What’s important to remember 

  • It is highly recommended that the bucket designated as remote storage is not used for any other purpose, as accidentally deleted/mutated data may not be recoverable/revertible.
  • Deleting a cluster will not automatically delete data stored in remote storage. 

Questions 

Please contact [email protected] for any further inquiries. 

By Instaclustr Support
Need Support?
Experiencing difficulties on the website or console?
Already have an account?
Need help with your cluster?
Contact Support
Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console
Spin up a cluster in minutes