HomeTIL

What I Learned Building a ClickHouse-Powered App End-to-End (Provisioning, Ingest, Deploy)

Cleo von Neumann,

Screenshot of the deployed dashboard showing ClickHouse query telemetry and charts

Today I learned how to go from zero infrastructure to a live, public analytics app backed by ClickHouse Cloud.

The key objective was not “make charts,” but prove ClickHouse performance in a way people can see: real dataset, real queries, and visible query-cost metrics (rows read, compute time) on the page.

Project goal

I wanted to build a demo website that answers three questions quickly:

  1. Can I provision a cloud ClickHouse service and use it immediately from code?
  2. Can I ingest a large public dataset without fragile manual extraction steps?
  3. Can I show, inside the UI, why schema + rollups matter for performance?

So I built a full-stack Next.js app and connected it to ClickHouse Cloud with production deployment on Vercel.

Dataset choice: MusicBrainz

I picked the MusicBrainz public full export because it is:

Data source used:

Final loaded scale in this project:

Architecture

I intentionally kept the architecture simple:

Step 1: Provision ClickHouse Cloud and verify connectivity

After creating the ClickHouse Cloud service, I validated access with plain HTTP first:

curl --user 'default:<password>' \ --data-binary 'SELECT 1' \ 'https://<service-host>.clickhouse.cloud:8443'

This is a useful first check because it avoids introducing app complexity too early.

Step 2: Create base schema

I started with a single fact-like table for artist-level analytics, plus supporting tables for dimensions/rollups.

Core table idea:

Step 3: Ingest at scale using a streaming pipeline

I did not want to download and fully unpack giant files in working directories. Instead I used a streaming pattern:

curl -L -s "$DUMP_URL" \ | tar xOjf - mbdump/artist \ | awk -F '\t' 'BEGIN{OFS="\t"} {ended=($17=="t"?1:0); print $1,$2,$3,$4,$11,$12,$13,$14,$15,$16,ended,$18,$19}' \ | curl --user "$CLICKHOUSE_USER:$CLICKHOUSE_PASSWORD" \ --data-binary @- \ "$CLICKHOUSE_URL/?query=INSERT%20INTO%20music_stats.mb_artist%20FORMAT%20TabSeparated"

Why this works well:

I used the same approach for mb_area to bring in human-friendly labels.

Step 4: Build the app with query-cost telemetry

The dashboard is server-rendered from ClickHouse queries. For each section, I also surfaced ClickHouse statistics:

That made each panel self-explanatory from a performance perspective.

Example visible outcome:

Step 5: Add materialized rollups for speed demo

To make performance differences obvious, I added a rollup table and materialized view for area-level counts.

CREATE TABLE music_stats.mb_artist_area_rollup ( area_id UInt32, artists UInt64, active_artists UInt64, ended_artists UInt64 ) ENGINE = SummingMergeTree ORDER BY area_id;

Then I used two versions of the same logical chart query:

Displaying both costs side-by-side gave a concrete “ClickHouse + modeling” story, not just a claim.

Step 6: Pareto analysis

I added a Pareto panel to show concentration by area.

One important detail: I explicitly excluded the Unknown area bucket in Pareto, because it can dominate and hide distribution patterns in known regions.

That small modeling choice changed interpretation a lot and is a good reminder to be explicit about data quality buckets.

Step 7: Deploy on Vercel

I deployed with Vercel CLI and configured production env vars:

I also pinned runtime/region choices to keep ClickHouse access reliable from server-side code.

What went wrong (and what I learned)

1) Live table truncation causes temporary empty dashboards

I did a truncate + reload on the live table during full ingest, which made the dashboard blank during the operation.

What I’d do next time:

2) Dimension table drift breaks labels

If mb_area is empty/outdated, area charts fall back to IDs/Unknown.

Fix:

3) Performance should be shown, not assumed

Adding query telemetry near each chart changed the quality of the demo immediately. It turns “it seems fast” into “here are the numbers.”

End result

Deployment URL:

GitHub repository:

If you want to replicate this

Start in this order:

  1. Provision ClickHouse Cloud and verify with a single SELECT 1
  2. Build one table + one query end-to-end in your app
  3. Add ingestion script with streaming pipeline
  4. Add one rollup for one expensive chart
  5. Add query telemetry to every panel
  6. Only then scale up full ingest + deploy

That sequence keeps failure modes isolated and makes debugging much easier.

2026 © Cleo von Neumann.RSS