AWS S3 Integration

ClickHouse clusters can be integrated with AWS S3 to enable access to specific AWS S3 regions or all regions.

Once a region is integrated, the cluster will be able to use the S3 table functions and engines to read and write data to AWS S3.

Clusters on the Netapp Instaclustr managed platform are secured through egress firewall rules to protect against data exfiltration. Integrating with AWS S3 adds a whitelist rule to the firewall enabling access. Consider the security risk before enabling an AWS S3 integration.

How To Enable

The following steps explain how to integrate a ClickHouse cluster with AWS S3.

  1. First select the “Integrations” option in console. The page will show existing integrations. Clusters provisioned in AWS will be integrated with the provisioned region by default, to support backups. The integration with the backup region cannot be deleted.
  2. Select “Add New Integration” to configure a new AWS S3 integration.
  3. For type select “AWS S3” then specify the region to integrate with, or all regions. Selecting “All Regions” will configure access to all AWS S3 regions.
  4. Finally press “Add” to configure the integration.
  5. The Integrations table now shows the newly configured integration. An integration can be deleted by pressing the “Delete” button, disabling access to the region.

How To Use ClickHouse S3 and S3Queue Table Engines

ClickHouse’s S3 and S3Queue integrations provide robust mechanisms for working with large datasets stored in S3. By leveraging these engines, you can efficiently manage and query your data directly from ClickHouse.  Brief examples regarding usage are included below.

For detailed information, refer to the official documentation:

 S3 Table Engine

The S3 table engine allows you to create tables that read from and write to S3.

Creating an S3 Table

To create a table using the S3 engine, you need to specify the S3 URL, access credentials, and the format of the data. Here is an example:

 Loading Data

Load data into the S3 table by inserting data directly:

Querying Data

Query data from the S3 table as you would with any other table:

 S3Queue Table Engine

The S3Queue table engine in ClickHouse is designed for streaming data import from S3-compatible storage, allowing continuous processing of data uploaded to an S3Queue table engine. This tutorial will guide you through setting up a data pipeline using S3Queue.

Creating an S3Queue Table

First, create a table using the S3Queue engine:

This table will serve as the source for data streaming from S3 S3Queue Table Engine.

Create a Destination Table

Next, create a destination table using a MergeTree engine:

This table will store the processed data from the S3Queue Table Engine.

Create the Materialized View

Set up a materialized view to automatically process data from the S3Queue table and insert it into the destination table:

This materialized view will start collecting data in the background as soon as it’s created.

Query the Destination Table

You can now query the destination table to access the processed data:

This query will return the data that has been streamed from S3 and processed through the S3Queue engine.

By following these steps, you’ve set up a continuous data ingestion pipeline from S3 to ClickHouse using the S3Queue engine. As new files are added to the specified S3 path, they will be automatically processed and inserted into your destination table.

Authentication and Using nosign Option

When working with S3 data sources, authentication is necessary to access non-public buckets. You have two main options for accessing S3:

Public Data Source with nosign

If you are accessing publicly shared datasets, you can use the nosign option, which allows you to bypass authentication. This is useful for accessing public datasets, as ClickHouse tries to fetch credentials from various sources, which can sometimes cause issues with public buckets leading to a 403  (forbidden) error. The nosign option forces the client to ignore all credentials and not sign the requests. Here is an example:

Private Data Source with IAM Role

If you are accessing private S3 buckets, you can add the cluster’s IAM role to the bucket to grant access. This method leverages AWS IAM roles to manage permissions securely. All Instaclustr Managed clusters are configured with an IAM role which can be used.

Grant Bucket Access:

Update your S3 bucket policy to include permissions for the IAM role. Here is an example policy:

Create S3 Table

You can now create and query S3 tables without specifying credentials directly in the SQL command:

By using IAM roles, you enhance security and simplify management of access credentials.

By Instaclustr Support
Need Support?
Experiencing difficulties on the website or console?
Already have an account?
Need help with your cluster?
Contact Support
Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console
Spin up a cluster in minutes