2 ClickHouse tutorials: Quick start and large-scale data analysis

What is ClickHouse

ClickHouse is a column-oriented database management system for online analytical processing (OLAP) queries. Developed by Yandex, it delivers high performance and processes large blocks of data efficiently.

Unlike traditional row-oriented databases, ClickHouse stores data by columns, optimizing read times for complex queries. A common use case for ClickHouse is large-scale data analytics, due to its ability to handle massive data volumes rapidly.

ClickHouse’s open-source nature allows for flexibility in development environments. The tool supports SQL queries, enabling easy integration with existing systems. It is useful in situations requiring real-time analytics and reporting, often in industries such as finance and retail.

ClickHouse is licensed under the Apache 2.0 open-source license. It has received over 36,000 GitHub stars and has more than 1500 contributors. You can get ClickHouse at the official GitHub repo.

Key features of ClickHouse

The following capabilities make ClickHouse useful for large-scale data analytics:

Columnar storage: Stores data by columns rather than rows. This structure optimizes read operations for analytical queries, as only the required columns are read from disk, reducing I/O overhead and increasing query speed.
Real-time data ingestion: Supports real-time data ingestion, allowing it to handle high-velocity data streams. This makes it suitable for environments requiring up-to-the-minute analytics, such as monitoring systems or financial markets.
Distributed and scalable architecture: Scales horizontally across multiple nodes, distributing data and queries across a cluster. This distributed nature enables it to handle petabytes of data while maintaining high performance.
Data compression: Uses compression algorithms, including LZ4 and ZSTD, to minimize storage space and reduce the amount of data that needs to be read from disk during queries. This saves disk space and improves query performance.
Vectorized query execution: Processes data in batches rather than row by row. This method takes advantage of modern CPU architectures, leading to faster query execution, especially for complex aggregations and transformations.
SQL support with extensions: Supports standard SQL, while also extending SQL with features like array joins, nested data structures, and window functions, providing more flexibility in query formulation.
Ecosystem and integrations: Has a rich ecosystem of tools and integrations, including support for Kafka for real-time data ingestion, integration with Grafana for visualization, and connectors for various data formats like JSON, Parquet, and ORC.

Related content: Read our guide to ClickHouse cluster (coming soon)

Tutorial 1: ClickHouse quick start

This quick start guide will help you set up ClickHouse on your local machine and perform basic operations like creating a table, inserting data, and running queries. These instructions are adapted from the ClickHouse documentation.

Installing ClickHouse

ClickHouse can run on Linux, FreeBSD, macOS, and Windows via WSL.

To download ClickHouse, use the following curl command, which will detect your OS and download the appropriate binary:

curl https://clickhouse.com/ | sh

1

curl https://clickhouse.com/ | sh
Once the binary is downloaded, you can start the ClickHouse server by executing:

./clickhouse server

1

./clickhouse server
To interact with the ClickHouse server, you need to use the clickhouse-client. Open a new terminal window, navigate to the directory containing the ClickHouse binary, and run:

./clickhouse client

1

./clickhouse client
If the client successfully connects to the server, you’ll see a smiling face, confirming the connection.

my-host :)

1

my-host :)

Setting up a table

In ClickHouse, querying a table is similar to other SQL databases, with one key addition: the ENGINE clause. This clause defines how the data will be stored and managed. Here’s an example of creating a simple table:

CREATE TABLE room_service_orders
(
    id UInt32,
    room_number String,
    food_order String,
    amount Float32,
    timestamp DateTime
)
ENGINE = MergeTree
PRIMARY KEY (id, timestamp);

1

2

3

4

5

6

7

8

9

10

CREATE TABLE room_service_orders

(

id UInt32,

room_number String,

food_order String,

amount Float32,

timestamp DateTime

)

ENGINE = MergeTree

PRIMARY KEY (id, timestamp);

terminal screenshot

In this example, the MergeTree engine is used, which is optimized for handling large volumes of data. The PRIMARY KEY specifies the columns used to sort the data within the table, optimizing query performance.

Inserting data into the table

Insert data into the ClickHouse table using the standard INSERT INTO command. Note that each insertion creates a new part in the storage, so it’s more efficient to insert data in bulk. Here’s an example:

INSERT INTO room_service_orders (id, room_number, food_order, amount, timestamp) VALUES
(1, '101', 'Burger and Fries', 15.50, '2024-09-11 08:30:00');

INSERT INTO room_service_orders (id, room_number, food_order, amount, timestamp) VALUES
(2, '102', 'Pizza Margherita', 12.00, '2024-09-11 09:00:00');

INSERT INTO room_service_orders (id, room_number, food_order, amount, timestamp) VALUES
(3, '103', 'Caesar Salad', 9.75, '2024-09-11 10:15:00');

INSERT INTO room_service_orders (id, room_number, food_order, amount, timestamp) VALUES
(4, '104', 'Club Sandwich', 11.25, '2024-09-11 12:00:00');

INSERT INTO room_service_orders (id, room_number, food_order, amount, timestamp) VALUES
(5, '105', 'Pasta Carbonara', 14.00, '2024-09-11 13:45:00');

1

2

3

4

5

6

7

8

9

10

11

12

13

14