What is ClickHouse?

ClickHouse is a high-performance, open-source columnar database management system optimized for online analytical processing (OLAP). It handles large volumes of data at high speeds, making it suited for real-time analytics and big data environments. Unlike traditional row-based databases, ClickHouse’s columnar storage structure processes and analyzes data in columns rather than rows, improving query speed for analytical workloads.

Available under the Apache-2.0 license, it has over 35K stars and 1500 contributors on GitHub. The repository can be found at https://github.com/ClickHouse/ClickHouse.

ClickHouse is commonly used in production and enterprise environments and provides several robust options for managing backup and restore of user databases. We’ll cover the most important options and how to implement them.

Overview of ClickHouse backup methods

ClickHouse provides multiple methods for creating and managing backups, each offering flexibility in terms of storage location, compression, encryption, and incremental backups. Here’s an overview of the main backup methods available:

  1. Local disk backup: Backups can be stored on a local disk by configuring a dedicated disk location in ClickHouse’s configuration file. The BACKUP command allows full or incremental backups of tables or entire databases, saving them to a specified local path. This is achieved by adding a configuration file that specifies the path and permissions for the backup location. Backups can then be created using BACKUP TABLE or BACKUP DATABASE commands with the specified disk.
  2. S3 and Azure Blob Storage: ClickHouse supports remote backups to S3-compatible storage and Azure Blob Storage, which is useful for distributed environments and ensures that data is safely stored off-site. Configuring these backups involves specifying endpoint URLs, access keys, and other required credentials in the BACKUP command. ClickHouse can also perform incremental backups on remote storage by referencing a base backup, which is beneficial for large datasets.
  3. Incremental backups: Incremental backups store only the changes since the last backup, reducing storage costs and backup time for large datasets. This is done by specifying the base backup file when initiating a new backup. However, both the base and incremental backups are required during a restore.
  4. Compressed and encrypted backups: ClickHouse supports custom compression levels and methods, such as lzma and gzip, which can reduce the backup size on disk. Password protection is also available for disk backups, providing an additional layer of security.
  5. Partition-level backups: ClickHouse enables users to back up or restore selected table partitions instead of entire tables, allowing more control over data recovery. This is beneficial in scenarios where only parts of the data need to be restored.
  6. File system snapshots and third-party tools: ClickHouse can also leverage filesystem snapshots (e.g., ZFS) for creating backups or use third-party tools like clickhouse-backup. These alternatives offer various levels of integration with the underlying storage system, with features for managing snapshots outside of ClickHouse’s native commands.

Each backup type is configurable with additional settings, such as synchronous or asynchronous operations and concurrent backup restrictions.

Quick tutorial: How to set up ClickHouse backups

Setting up backups in ClickHouse can be done using the clickhouse-backup utility, which provides a way to create local and remote backups while managing storage effectively. This tutorial walks through installing, configuring, and scheduling backups with clickhouse-backup.

Step 1: Install clickhouse-backup utility

To install the clickhouse-backup utility, run the following Bash script. This script downloads the specified version of clickhouse-backup, extracts it, and moves it to /usr/bin for easy access.

tutorial screenshot

Step 2: Configure clickhouse-backup

Once installed, configure clickhouse-backup by editing the configuration file /etc/clickhouse-backup/config.yml. Here’s an example configuration:

If your sFTP server uses a PEM key, you can use the following configuration for SFTP (only that portion is shown).

This configuration specifies details for both ClickHouse and the remote storage, such as credentials, backup location, and tables to exclude.

Step 3: Create a backup script

The following script enables full and incremental backups to either local or remote storage, depending on the arguments passed. Save this script as /etc/default/clickhouse-backup-run.sh:

This script uses parameters to control whether the backup is full or incremental and whether it is stored locally or remotely. It checks the backup type and initiates the clickhouse-backup utility with the appropriate command.

Step 4: Schedule backups with cron

To automate backups, schedule the script with cron. The example below sets up daily local backups and weekly remote backups for each shard:

  1. Daily local backup (for each replica): Create a cron job in /etc/cron.d/clickhouse-backup:

    tutorial screenshot

  2. Weekly full backup and daily incremental backup to remote storage (for one replica per shard): Configure the following cron jobs:

     

These cron jobs handle regular backups with minimal manual intervention, ensuring data safety and easy recovery.

Note: You can also use the crontab -e command to set up cron scripts. It allows you to use a text editor like nano or vim in Linux.

Step 5: Secure and manage backup storage

To secure backups, consider moving them from accessible directories to a protected storage area. This script moves backups from an FTP-accessible directory to a secured directory:

Step 6: Cleanup old backups

This cleanup script keeps only the latest 20 backups in the hidden directory:

With this setup, the ClickHouse backups are automated, secured, and regularly cleaned up, ensuring efficient use of storage space and reliable disaster recovery options.

Related content: Read our guide to ClickHouse tutorial

Tips from the expert

Suresh Vasanthakumar photo

Suresh Vasanthakumar

Site Reliability Engineer

Suresh is a seasoned database engineer with over a decade of experience in designing, deploying and optimizing high-performance distributed systems. Specializing in in-memory data stores, Suresh has deep expertise in managing Redis and Valkey clusters for enterprise-scale applications.

In my experience, here are tips that can help you better manage and optimize ClickHouse backups:

  1. Use differential backups for complex data requirements: Consider differential backups in addition to full and incremental ones, particularly for complex, large datasets. Differential backups capture data since the last full backup, balancing storage efficiency with shorter recovery times compared to relying solely on incremental backups.
  2. Implement redundancy in backup storage: Distribute backups across multiple storage locations or providers. For example, you could store backups on both S3 and Azure Blob, or on different geographical regions within the same provider, increasing resilience in case of provider or regional outages.
  3. Automate snapshot backups for faster rollbacks: Use file system snapshots (like those offered by ZFS or LVM) alongside ClickHouse’s native backup tools. This allows for almost instantaneous rollbacks, making it useful for environments requiring frequent backups or quick rollbacks, such as in staging or test environments.
  4. Leverage partition-level backups to isolate critical data: For frequently accessed or regulatory-sensitive data, use partition-level backups. This approach allows you to isolate and restore only the most critical portions, reducing recovery time for specific datasets and enhancing data compliance.
  5. Configure pre- and post-backup scripts for data consistency: Implement scripts that prepare the system for backups and validate their completion. For example, pause ingesting processes or run final checkpoints before backups, then verify data consistency post-backup to ensure clean and reliable data snapshots.

Best practices for ClickHouse backups

Here are some best practices to consider when implementing backups in ClickHouse.

Regular Backup Scheduling

A consistent backup schedule is essential to protect data against unexpected loss. The frequency of backups should align with the data’s volatility and business requirements. For example , in environments where data changes frequently, daily full backups during low-traffic periods can minimize system impact while ensuring data is consistently protected.

Tools like clickhouse-backup can automate this process, reducing the risk of human error and ensuring backups are performed reliably. Regular scheduling also supports compliance with data retention policies and regulatory requirements. By maintaining a predictable backup routine, organizations can ensure data is available for restoration within acceptable timeframes.

Incremental backups

To optimize storage usage and reduce the time required for backup operations, implement incremental backups that capture only the changes since the last full backup. This approach is particularly beneficial for large datasets, as it minimizes the amount of data processed during each backup cycle.

However, it’s important to ensure that both base and incremental backups are available and properly managed, as they are required together for a complete restore. Implementing incremental backups requires planning and verification to manage dependencies between backup sets. A clear retention policy for incremental backups helps prevent excessive accumulation of backup files.

Secure backup storage

Storing backups in secure, off-site locations helps protect against hardware failures, data center incidents, or other disasters. ClickHouse supports remote backups to S3-compatible storage and Azure Blob Storage, enabling off-site storage. Implementing encryption for backups adds an additional layer of protection, keeping sensitive data secure if backup media is compromised.

In addition to encryption, access controls should be enforced to restrict backup access to authorized personnel only. Regular audits of backup storage environments can help identify and mitigate potential security vulnerabilities. Diversifying storage locations across geographic regions can provide additional resilience against regional outages or disasters.

Backup verification

Regularly testing backup restoration processes helps confirm data integrity and the reliability of the backup strategy. It ensures that backups are functional and can be restored promptly when needed, reducing downtime and potential data loss. Establishing a routine for backup verification helps identify and address issues proactively.

Automated testing of backup restorations can simplify the verification process and provide timely feedback on backup health. Documenting restoration procedures and maintaining up-to-date recovery plans are also essential components of a comprehensive backup strategy.

Exclude non-essential data

To conserve storage space and streamline the backup process, exclude system tables and other non-critical data from backups. Configuring the backup tool to omit tables like system.* and INFORMATION_SCHEMA.* ensures that only essential data is backed up, reducing the size and complexity of backup files.

Regularly reviewing and updating the list of excluded data is important to adapt to changing data environments. As new tables or databases are introduced, assessing their criticality ensures that backup policies remain aligned with business priorities.

Monitor backup processes

Implementing monitoring for backup operations allows for the detection and prompt addressing of failures or issues. Setting up alerts for backup completion statuses and errors helps maintain the reliability of the backup system. Monitoring tools can provide insights into backup performance, duration, and success rates, enabling continuous improvement of the backup.

Regular analysis of monitoring data can reveal trends or recurring issues that may require attention. For example, increasing backup durations might indicate growing data volumes or performance bottlenecks.

Reliable data backups and restoration with Instaclustr for ClickHouse

Instaclustr for ClickHouse offers an outstanding solution for organizations seeking a powerful, fully-managed ClickHouse experience. A standout feature of the service is the robust backup system, designed to protect critical data and ensure peace of mind, no matter the scale of operations.

Instaclustr’s automated backup capabilities include incremental backups that capture data precisely when it is needed. These seamless backups are designed to minimize operational impact, allowing teams to focus on leveraging ClickHouse’s lightning-fast analytic queries without disruption. Whether analyzing large datasets or running real-time reports, Instaclustr’s backup solutions guarantee data is securely stored and easily retrievable when required.

Additionally, Instaclustr ensures effortless restoration, whether it’s for roll back to a specific point in time or to recover from the unexpected. Instaclustr’s approach not only reinforces business continuity but also aligns with best practices for data protection and compliance. Plus, with the constant support of Instaclustr’s expert team, organizations always have guidance to optimize ClickHouse environments.

By streamlining backup processes and ensuring top-tier reliability, Instaclustr for ClickHouse empowers businesses with data resilience to make confident decisions supported by secure, accessible, and well-managed data solutions.

For more information: