Included Benchmarks

PBench ships with ready-to-run benchmarks in the benchmarks/ directory. Each benchmark includes SQL queries, stage configuration files for various scale factors and storage formats, and (where applicable) data generation scripts.

TPC-H

TPC-H is a decision-support benchmark with 22 queries over a relational schema of 8 tables (orders, lineitems, parts, suppliers, etc.).

Directory: benchmarks/tpch/

Queries

22 SQL queries in queries/ (query_01.sql – query_22.sql).

Stage files

File	Description
`tpch.json`	All 22 queries with expected row counts (SF1000)

Scale factor configs

File	Scale Factor	Catalog / Schema
`sf1.json`	1 GB	TPCH connector, schema `sf1`
`sf10.json`	10 GB	Hive, `tpch_sf10_parquet`
`sf100.json`	100 GB	Hive, `tpch_sf100_parquet`
`sf1k.json`	1 TB	Hive, `tpch_sf1000_parquet`
`sf10k.json`	10 TB	Hive, `tpch_sf10000_parquet`
`sf100k.json`	100 TB	Hive, `tpch_sf100000_parquet`

Format variants

File	Format	Notes
`sf1k_ice.json`	Iceberg	`optimizer_use_histograms: true`
`sf1k_ice_par.json`	Iceberg (partitioned)
`sf1k_delta_symlink.json`	Delta (symlink)
`sf1k_delta_symlink_par.json`	Delta (symlink, partitioned)
`sf100-trino.json`	Iceberg (Trino)	Includes all 22 queries inline, `save_json: true`

Throughput streams

42 stream files in streams/ (stream_01.json – stream_42.json). Each stream runs all 22 queries in a different order with start_on_new_client: true, enabling concurrent throughput testing.

Example usage

# Power test: 1 cold + 2 warm runs of all 22 queries at SF1000
pbench run -s http://localhost:8080 -o results \
  benchmarks/tpch/tpch.json benchmarks/tpch/sf1k.json \
  benchmarks/java_oss.json benchmarks/c1w2.json

# Throughput test: 4 concurrent streams
pbench run -s http://localhost:8080 -o results \
  benchmarks/tpch/streams/stream_{01,02,03,04}.json benchmarks/tpch/sf1k.json \
  benchmarks/java_oss.json benchmarks/c1w2.json

TPC-DS

TPC-DS is a decision-support benchmark with 99 queries over a retail sales schema of 24 tables.

Directory: benchmarks/tpc-ds/

Queries

99 SQL queries plus 5 ordered variants in queries/. The ordered variants (query_36_ordered, query_65_ordered, query_71_ordered, query_73_ordered, query_77_ordered) add deterministic ORDER BY clauses for result comparison.

Stage files

File	Description
`ds_power.json`	All 99 queries with expected row counts at multiple scale factors
`ds_full.json`	All 104 queries (99 + 5 ordered variants)
`ds_atomic.json`	44 queries testing individual SQL operations (joins, aggregations, set operations)
`ds_subset.json`	Subset of queries
`ds_rand5.json`	Randomly execute 5 queries from the pool
`ds_rand15m.json`	Random execution for 15 minutes
`ds_rand50.json`	Randomly execute 50 queries

Scale factor configs

File	Scale Factor	Catalog / Schema
`sf1.json`	1 GB	TPC-DS connector, schema `sf1`
`sf10.json`	10 GB	Hive, `tpcds_sf10_parquet_varchar`
`sf100.json`	100 GB	Hive, `tpcds_sf100_parquet_v2`
`sf1k.json`	1 TB	Hive, `tpcds_sf1000_parquet_v2`
`sf10k.json`	10 TB	Hive, `tpcds_sf10000_parquet`
`sf30k.json`	30 TB	Hive, `tpcds_sf30000_parquet`
`sf100k.json`	100 TB	Hive, `tpcds_sf100000_parquet`

Format variants

File	Format	Notes
`sf1k_ice.json`	Iceberg
`sf1k_ice_par.json`	Iceberg (partitioned)
`sf1k_ice_uncompressed.json`	Iceberg (uncompressed)
`sf10k_ice.json`	Iceberg
`sf10k_ice_par.json`	Iceberg (partitioned)
`sf30k_ice.json`	Iceberg
`sf1k_par.json`	Parquet (partitioned)
`sf10k_par.json`	Parquet (partitioned)
`sf100k_par.json`	Parquet (partitioned)
`sf10k_dwrf.json`	DWRF (ORC)

Throughput streams

23 stream files in streams/ (stream_01.json – stream_23.json). Each stream runs all 99 queries in a different order for concurrent throughput testing.

Example usage

# Power test at SF10000
pbench run -s http://localhost:8080 -o results \
  benchmarks/tpc-ds/ds_power.json benchmarks/tpc-ds/sf10k.json \
  benchmarks/native_oss.json benchmarks/c1w2.json

# Random 50 queries at SF1000
pbench run -s http://localhost:8080 -o results \
  benchmarks/tpc-ds/ds_rand50.json benchmarks/tpc-ds/sf1k.json \
  benchmarks/native_oss.json benchmarks/c1w2.json

ClickBench

ClickBench is an OLAP benchmark with 43 queries over a single wide hits table of web analytics data.

Directory: benchmarks/clickbench/

Queries

43 SQL queries in queries/ (query_01.sql – query_43.sql).

Stage files

File	Description
`clickbench.json`	All 43 queries with expected row counts; sets `offset_clause_enabled: true`

Schema: clickbench_parquet.

Example usage

pbench run -s http://localhost:8080 -o results \
  benchmarks/clickbench/clickbench.json \
  benchmarks/java_oss.json benchmarks/c1w2.json

IMDB (Join Order Benchmark)

The Join Order Benchmark (JOB) uses the IMDB dataset to evaluate join ordering and cardinality estimation. It has 113 queries with complex multi-way joins.

Directory: benchmarks/imdb/

Queries

113 SQL queries in queries/, named by group and variant (1a.sql, 1b.sql, ..., 33c.sql).

Stage files

File	Description
`imdb.json`	All 113 queries; schema `imdb`

Example usage

pbench run -s http://localhost:8080 -o results \
  benchmarks/imdb/imdb.json \
  benchmarks/java_oss.json benchmarks/c1w2.json

Shared Configuration Files

Top-level JSON files in benchmarks/ configure engine settings and execution parameters. They are designed to be composed with benchmark stage files via multiple -f arguments or positional args.

Engine configs

These set catalog, session parameters, and pushdown settings for different Presto/Trino variants:

File	Engine	Catalog
`java_oss.json`	Java OSS	Hive
`native_oss.json`	Native OSS	Hive
`java_blueray.json`	Java BlueRay	Hive
`native_blueray.json`	Native BlueRay	Hive
`java_trino.json`	Trino	Hive
`java_oss_glue.json`	Java OSS	Glue
`native_oss_glue.json`	Native OSS	Glue
`java_blueray_glue.json`	Java BlueRay	Glue
`native_blueray_glue.json`	Native BlueRay	Glue

Execution configs

File	Description
`c1w2.json`	1 cold run, 2 warm runs
`abort_on_error.json`	Stop on first query failure
`save_output.json`	Save query result output
`save_json.json`	Save query info JSON
`save_colmd.json`	Save column metadata

Composing configs

Stage files are merged left-to-right, so later files override earlier ones. A typical invocation combines a benchmark, a scale factor, an engine config, and execution settings:

pbench run -s http://localhost:8080 -o results \
  benchmarks/tpc-ds/ds_power.json \   # benchmark + queries
  benchmarks/tpc-ds/sf1k.json \       # scale factor / schema
  benchmarks/native_oss.json \        # engine + session params
  benchmarks/c1w2.json                # execution settings

Utility Scripts

benchmarks/scripts/ contains Python utilities for cache management and database connectivity:

Script	Description
`presto_utils.py`	Presto/Trino HTTPS connection and query helpers
`mysql_utils.py`	MySQL connection and query helpers
`system_utils.py`	SSH remote command execution (Paramiko)
`cache_cleaning_coordinator.py`	Clear Hive/Iceberg metadata caches on coordinator
`cache_cleaning_workers.py`	Clear SSD, page, and memory caches on workers via SSH

Included Benchmarks

TPC-H

Queries

Stage files

Scale factor configs

Format variants

Throughput streams

Example usage

TPC-DS

Queries

Stage files

Scale factor configs

Format variants

Throughput streams

Example usage

ClickBench

Queries

Stage files

Example usage

IMDB (Join Order Benchmark)

Queries

Stage files

Example usage

Shared Configuration Files

Engine configs

Execution configs

Composing configs

Utility Scripts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally