Impala TPC-DS Run Comparison

Interactive comparison across three suite runs with per-query and parallel timings. Use the controls to switch modes, filter queries, and inspect SQL + sample result rows.

AWS DataHub (S3-backed)

Cloud environment running Cloudera DataHubs on AWS with data stored in S3.

STACKIT Local Disks (Ozone)

STACKIT environment using local disk-backed storage (Ozone profile).

STACKIT S3

STACKIT environment using STACKIT's S3 object storage profile.

Infrastructure and Methodology

The three benchmark variants are designed to compare storage and platform behavior while keeping Impala execution limits as comparable as possible.

Impala Compute Profile on STACKIT

  • Impala runs in CDW on Kubernetes to decouple executor placement from data locality.
  • 10 executor nodes, 45 GB memory limit per executor, 6 vCPU pods.
  • Executor pods run on ECS worker nodes.
  • For Ozone runs, benchmark data is hosted on Ozone running on the base cluster's 4 worker nodes.

Storage Variants

  • Ozone run: Ozone with 3x replication and filesystem-optimized layout.
  • Block storage profile: ~100 MB/s class block devices (enterprise HDD approximation).
  • STACKIT S3 run: same compute setup as Ozone, data placed on STACKIT S3.

AWS DataHub (Data Mart)

  • 10 Impala executors with 45 GB memory limit per executor.
  • Data stored in S3.
  • Used as cross-platform comparison against STACKIT runs.

Common Impala Tuning

  • NUM_SCANNER_THREADS=4 to normalize scan-side CPU pressure across platforms.
  • MT_DOP=4 to keep intra-node parallelism aligned.
  • Table statistics precomputed so catalog/coordinator metadata stays warm before query execution.
Query runtime mode
Visible environments

Per-query runtime comparison

Hover a bar for rows/sample output. Click a query to load SQL and detailed run info below.

⟵⟶ Pan horizontally to explore more queries

Selected query details

Click a query bar to inspect SQL text, row counts, and up to 10 sample result rows.

Aggregated runtime by environment

Simple total runtime comparison across environments for the selected mode.