In the last installment of the pipeline blog series, we explored writing streaming JSON data into PostgreSQL using Kafka Connect. In this blog, it’s time to find out if our fears of encountering mutant monsters in the gloom of an abandoned uranium mine were warranted or not.

1. What Is Apache Superset?

What is Apache Superset? Well, it could be a set of (possibly mutant) superheroes (supposedly, many superheroes got their powers from mutations, and biological mutations are actually normal and not a byproduct of radiation)?

(Source: Shutterstock)

Or perhaps it could be related to a mathematical superset? B is a superset of A if B contains all the elements of A (B ⊇ A). For example, if A is the set of all superheroes that use weapons, and B is the set of all superheroes, then B is the superset of A:

However, my guess is that the mathematical idea inspired Apache Superset as the project page proclaims that:

“Apache Superset is a modern data exploration and visualization platform that is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.”

Here are some interesting points I discovered about Superset:

  • It is based on SQL and can connect to any SQL database that is supported by SQL Alchemy. So at least in theory, this gives good interoperability with a large number of data sources (but also excludes some common NoSQL databases, unfortunately).
  • But what if you aren’t an SQL guru? It supports a zero-code visualization builder, but also a more detailed SQL Editor.
  • Superset supports more than 50 built-in charts offering plenty of choices for how you visualize your data.
  • The main Superset concepts are databases, data sources, charts, and dashboards.
  • It’s written in Javascript and Python.
  • The focus is on “BI” (Business Intelligence) rather than machine metrics (c.f. Prometheus/Grafana) or scientific data visualization (c.f. https://matplotlib.org/ – which helped produce the 1st photograph of a black hole).
  • Superset started out in Airbnb, and graduated to being a top-level Apache project this year (2021), so it’s timely to be evaluating it now.

In this blog, I wanted to get Apache Superset working with PostgreSQL. Some of the challenges I expected were:

  1. How easy is it to deploy Apache Superset?
  2. Can Apache Superset be connected to PostgreSQL (specifically, a managed Instaclustr PostgreSQL service, I had access to a preview service)?
  3. Is it possible to chart JSONB column data types in Apache Superset?
  4. Does Apache Superset have appropriate chart types to display the basic NOAA tidal data and also map the locations?

2. Deploying Apache Superset

So far in the pipeline series I’ve been used to the ease of spinning up managed services on the Instaclustr platform (e.g. Elasticsearch and Kibana). However, we currently don’t have Superset on our platform, so I was going to have to install and run it on my laptop (a Mac).  I’m also used to the ease of deploying Java-based applications (get a jar, run jar). However, Superset is written in Python, which I have no prior experience with. I, therefore, tried two approaches to deploying it.

Approach 1

I initially tried to install Superset from scratch using “pip install apache-superset”. However, this failed with too many errors.

Approach 2

My second approach used Docker, which eventually (it was very slow to build) worked. I just used the built-in Mac Docker environment which worked fine—but don’t forget to increase the available memory from the default of 2GB (which soon fails) to at least 6GB. It’s also very resource-hungry while running, using 50% of the available CPU.

Superset runs in a browser at  https://localhost:8088

Don’t forget to change the default user/password from admin/admin to something else.

So, given our initial false start, and by using Docker, it was relatively easy to deploy and run Superset.

3. Connecting Superset to PostgreSQL

In Superset, if you go into “Data->Databases” you find that there’s already a PostgreSQL database and drivers installed and running in the Docker version. It turns out that Superset uses PostgreSQL as the default metadata store (and for the examples).

But I wanted to use an external PostgreSQL database, and that was easy to configure on Superset following this documentation.

The PostgreSQL URL format to connect to a fully managed Instaclustr enterprise-grade service for PostgreSQL looks like this (with the pguser, pgpassword and pgip address obtained from the Instaclustr Console):

postgresql://pguser:pgpassword@pgip:5432/postgresl

Superset has the concepts of databases and datasets. After connecting to the Instaclustr managed PostgreSQL database and testing the connection you “Add” it, and then a new database will appear. After configuring a database you can add a dataset, by selecting a database, schema, and table.

So, the second challenge was also easy, and we are now connected to the external PostgreSQL database.