What is ClickHouse

ClickHouse is a column-oriented database management system for online analytical processing (OLAP) queries. Developed by Yandex, it delivers high performance and processes large blocks of data efficiently.

Unlike traditional row-oriented databases, ClickHouse stores data by columns, optimizing read times for complex queries. A common use case for ClickHouse is large-scale data analytics, due to its ability to handle massive data volumes rapidly.

ClickHouse’s open-source nature allows for flexibility in development environments. The tool supports SQL queries, enabling easy integration with existing systems. It is useful in situations requiring real-time analytics and reporting, often in industries such as finance and retail.

ClickHouse is licensed under the Apache 2.0 open-source license. It has received over 36,000 GitHub stars and has more than 1500 contributors. You can get ClickHouse at the official GitHub repo.

Key features of ClickHouse

The following capabilities make ClickHouse useful for large-scale data analytics:

  • Columnar storage: Stores data by columns rather than rows. This structure optimizes read operations for analytical queries, as only the required columns are read from disk, reducing I/O overhead and increasing query speed.
  • Real-time data ingestion: Supports real-time data ingestion, allowing it to handle high-velocity data streams. This makes it suitable for environments requiring up-to-the-minute analytics, such as monitoring systems or financial markets.
  • Distributed and scalable architecture: Scales horizontally across multiple nodes, distributing data and queries across a cluster. This distributed nature enables it to handle petabytes of data while maintaining high performance.
  • Data compression: Uses compression algorithms, including LZ4 and ZSTD, to minimize storage space and reduce the amount of data that needs to be read from disk during queries. This saves disk space and improves query performance.
  • Vectorized query execution: Processes data in batches rather than row by row. This method takes advantage of modern CPU architectures, leading to faster query execution, especially for complex aggregations and transformations.
  • SQL support with extensions: Supports standard SQL, while also extending SQL with features like array joins, nested data structures, and window functions, providing more flexibility in query formulation.
  • Ecosystem and integrations: Has a rich ecosystem of tools and integrations, including support for Kafka for real-time data ingestion, integration with Grafana for visualization, and connectors for various data formats like JSON, Parquet, and ORC.

Related content: Read our guide to ClickHouse cluster (coming soon)

Tutorial 1: ClickHouse quick start

This quick start guide will help you set up ClickHouse on your local machine and perform basic operations like creating a table, inserting data, and running queries. These instructions are adapted from the ClickHouse documentation.
 

Installing ClickHouse

ClickHouse can run on Linux, FreeBSD, macOS, and Windows via WSL.

  • To download ClickHouse, use the following curl command, which will detect your OS and download the appropriate binary:

    terminal screenshot

  • Once the binary is downloaded, you can start the ClickHouse server by executing:
  • To interact with the ClickHouse server, you need to use the clickhouse-client. Open a new terminal window, navigate to the directory containing the ClickHouse binary, and run:

    terminal screenshot

  • If the client successfully connects to the server, you’ll see a smiling face, confirming the connection.

 

Setting up a table

In ClickHouse, querying a table is similar to other SQL databases, with one key addition: the ENGINE clause. This clause defines how the data will be stored and managed. Here’s an example of creating a simple table:

terminal screenshot

In this example, the MergeTree engine is used, which is optimized for handling large volumes of data. The PRIMARY KEY specifies the columns used to sort the data within the table, optimizing query performance.
 

Inserting data into the table

Insert data into the ClickHouse table using the standard INSERT INTO command. Note that each insertion creates a new part in the storage, so it’s more efficient to insert data in bulk. Here’s an example:

terminal screenshot

This command inserts multiple rows into the table. For optimal performance, especially with the MergeTree engine, it’s best to insert tens of thousands or even millions of rows at a time.

You can retrieve all records from the table and order them by the timestamp column:

The results will be displayed in a table format:

terminal screenshot

Once you’re familiar with the basics, you can begin inserting your own data. ClickHouse provides a variety of integrations and table functions for importing data from different sources, including S3, MySQL, and PostgreSQL.

For example, you can use the s3 table function to read data directly from an S3 bucket:

terminal screenshot

To move this data into a ClickHouse table, you can use the following command:

terminal screenshot

Tips from the expert

Andrew Mills

Andrew Mills

Senior Solution Architect

Andrew Mills is an industry leader with extensive experience in open source data solutions and a proven track record in integrating and managing Apache Kafka and other event-driven architectures

In my experience, here are tips that can help you better utilize ClickHouse for large-scale data processing and analytics:

  • Take advantage of ClickHouse’s asynchronous insert mode: If you’re dealing with high-velocity data ingestion, use the async_insert and async_insert_max_data_size settings to optimize throughput. These settings queue inserts asynchronously, reducing client-server latency.
  • Bulk inserts for better storage efficiency: Avoid frequent small inserts. Batch data into large blocks (100,000 rows or more) for insertion. This reduces the overhead of part merges and improves compression ratios, saving both storage space and write time.
  • Optimize MergeTree settings for your workload: Fine-tune parameters such as index_granularity, merge_with_ttl_timeout, and max_part_size_for_merge in the MergeTree engine. These settings can significantly improve performance for both read-heavy and write-heavy workloads.
  • Use materialized views for real-time aggregation: Materialized views can automate data pre-aggregation, which boosts query performance for real-time dashboards. They allow you to aggregate metrics as data is ingested, reducing the need for heavy on-the-fly calculations.
  • Leverage ClickHouse’s data skipping indices: Use primary and secondary indices (like minmax or bloom_filter indexes) to skip over large parts of the dataset during query execution. This can drastically reduce the amount of data that needs to be read and processed, especially in large tables.

Tutorial 2: Managing and analyzing a large dataset

In this section, we’ll explore how to manage and analyze large datasets using ClickHouse. We will use a real-world example by working with global airline flight data, which contains millions of records.
 

Creating and populating a table

  1. First, we’ll create a table to store flight data. This dataset includes details like flight number, airline, departure and arrival times, airports, and flight durations. Below is the SQL command to create the amazon_reviews table:

    terminal screenshot
    This table is designed to handle and store the data, leveraging the MergeTree engine, which is optimal for managing large datasets.

  2. Next, we’ll insert a large dataset into this table. The dataset consists of approximately two million rows of flight data. Here’s how you can insert the data directly from files stored in an S3 bucket:

    terminal screenshot
    This command reads data from the specified S3 URLs and inserts it into the amazon_reviews table. ClickHouse handles the large data volumes, ensuring the entire dataset is ingested quickly.

 

Analyzing data with ClickHouse

Once the data is loaded, you can begin performing various analytical queries:

  1. Start by computing the average total fare for all flights:

    terminal screenshot

  2. Next, let’s calculate the average number of helpful and total votes by category

    terminal screenshot

  3. You can also perform more complex calculations, such as analyzing the helpful votes for verified and unverified purchases with star rating of 4 or better:

    terminal screenshot
    This query groups trips by their duration, providing averages for tips, fares, and passenger counts.

 

Leveraging dictionaries and performing joins

ClickHouse supports the use of dictionaries, which are mappings of key-value pairs stored in memory, enabling faster lookups during queries:

  1. Create a dictionary based on a CSV file containing VERIFIED / UN-VERFIED orders key values pairs:

    Note enum.csv file resides in the same folder and has the following structure:

    terminal screenshot
    This dictionary can then be used in queries to join with the amazon_reviews table or retrieve specific values.

  2. For example, to perform a join using the newly created dictionary:

    This query identifies the number of helpful votes by verification status by joining the table with a dictionary.

Efficiency and scalability amplified: The benefits of Instaclustr for ClickHouse

Instaclustr provides a range of benefits for ClickHouse, making it an excellent choice for organizations seeking efficient and scalable management of these deployments. With its managed services approach, Instaclustr simplifies the deployment, configuration, and maintenance of ClickHouse, enabling businesses to focus on their core applications and data-driven insights.

Some of these benefits are:

  • Infrastructure provisioning, configuration, and security, ensuring that organizations can leverage the power of this columnar database management system without the complexities of managing it internally. By offloading these operational tasks to Instaclustr, organizations can save valuable time and resources, allowing them to focus on utilizing ClickHouse to its full potential.
  • Seamless scalability to meet growing demands. With automated scaling capabilities, ClickHouse databases can expand or contract based on workload requirements, ensuring optimal resource utilization and cost efficiency. Instaclustr’s platform actively monitors the health of the ClickHouse cluster and automatically handles scaling processes, allowing organizations to accommodate spikes in traffic and scale their applications effectively.
  • High availability and fault tolerance for ClickHouse databases. By employing replication and data distribution techniques, Instaclustr ensures that data is stored redundantly across multiple nodes in the cluster, providing resilience against hardware failures and enabling continuous availability of data. Instaclustr’s platform actively monitors the health of the ClickHouse cluster and automatically handles failover and recovery processes, minimizing downtime and maximizing data availability for ClickHouse deployments.

Furthermore, Instaclustr’s expertise and support are invaluable for ClickHouse databases. Our team of experts has deep knowledge and experience in managing and optimizing ClickHouse deployments. We stay up-to-date with the latest advancements in ClickHouse technologies, ensuring that the platform is compatible with the latest versions and providing customers with access to the latest features and improvements. Instaclustr’s 24/7 support ensures that organizations have the assistance they need to address any ClickHouse-related challenges promptly.

For more information: