-
Notifications
You must be signed in to change notification settings - Fork 20
Included Benchmarks
PBench ships with ready-to-run benchmarks in the benchmarks/ directory. Each benchmark includes SQL queries, stage configuration files for various scale factors and storage formats, and (where applicable) data generation scripts.
TPC-H is a decision-support benchmark with 22 queries over a relational schema of 8 tables (orders, lineitems, parts, suppliers, etc.).
Directory: benchmarks/tpch/
22 SQL queries in queries/ (query_01.sql – query_22.sql).
| File | Description |
|---|---|
tpch.json |
All 22 queries with expected row counts (SF1000) |
| File | Scale Factor | Catalog / Schema |
|---|---|---|
sf1.json |
1 GB | TPCH connector, schema sf1
|
sf10.json |
10 GB | Hive, tpch_sf10_parquet
|
sf100.json |
100 GB | Hive, tpch_sf100_parquet
|
sf1k.json |
1 TB | Hive, tpch_sf1000_parquet
|
sf10k.json |
10 TB | Hive, tpch_sf10000_parquet
|
sf100k.json |
100 TB | Hive, tpch_sf100000_parquet
|
| File | Format | Notes |
|---|---|---|
sf1k_ice.json |
Iceberg | optimizer_use_histograms: true |
sf1k_ice_par.json |
Iceberg (partitioned) | |
sf1k_delta_symlink.json |
Delta (symlink) | |
sf1k_delta_symlink_par.json |
Delta (symlink, partitioned) | |
sf100-trino.json |
Iceberg (Trino) | Includes all 22 queries inline, save_json: true
|
42 stream files in streams/ (stream_01.json – stream_42.json). Each stream runs all 22 queries in a different order with start_on_new_client: true, enabling concurrent throughput testing.
# Power test: 1 cold + 2 warm runs of all 22 queries at SF1000
pbench run -s http://localhost:8080 -o results \
benchmarks/tpch/tpch.json benchmarks/tpch/sf1k.json \
benchmarks/java_oss.json benchmarks/c1w2.json
# Throughput test: 4 concurrent streams
pbench run -s http://localhost:8080 -o results \
benchmarks/tpch/streams/stream_{01,02,03,04}.json benchmarks/tpch/sf1k.json \
benchmarks/java_oss.json benchmarks/c1w2.jsonTPC-DS is a decision-support benchmark with 99 queries over a retail sales schema of 24 tables.
Directory: benchmarks/tpc-ds/
99 SQL queries plus 5 ordered variants in queries/. The ordered variants (query_36_ordered, query_65_ordered, query_71_ordered, query_73_ordered, query_77_ordered) add deterministic ORDER BY clauses for result comparison.
| File | Description |
|---|---|
ds_power.json |
All 99 queries with expected row counts at multiple scale factors |
ds_full.json |
All 104 queries (99 + 5 ordered variants) |
ds_atomic.json |
44 queries testing individual SQL operations (joins, aggregations, set operations) |
ds_subset.json |
Subset of queries |
ds_rand5.json |
Randomly execute 5 queries from the pool |
ds_rand15m.json |
Random execution for 15 minutes |
ds_rand50.json |
Randomly execute 50 queries |
| File | Scale Factor | Catalog / Schema |
|---|---|---|
sf1.json |
1 GB | TPC-DS connector, schema sf1
|
sf10.json |
10 GB | Hive, tpcds_sf10_parquet_varchar
|
sf100.json |
100 GB | Hive, tpcds_sf100_parquet_v2
|
sf1k.json |
1 TB | Hive, tpcds_sf1000_parquet_v2
|
sf10k.json |
10 TB | Hive, tpcds_sf10000_parquet
|
sf30k.json |
30 TB | Hive, tpcds_sf30000_parquet
|
sf100k.json |
100 TB | Hive, tpcds_sf100000_parquet
|
| File | Format | Notes |
|---|---|---|
sf1k_ice.json |
Iceberg | |
sf1k_ice_par.json |
Iceberg (partitioned) | |
sf1k_ice_uncompressed.json |
Iceberg (uncompressed) | |
sf10k_ice.json |
Iceberg | |
sf10k_ice_par.json |
Iceberg (partitioned) | |
sf30k_ice.json |
Iceberg | |
sf1k_par.json |
Parquet (partitioned) | |
sf10k_par.json |
Parquet (partitioned) | |
sf100k_par.json |
Parquet (partitioned) | |
sf10k_dwrf.json |
DWRF (ORC) |
23 stream files in streams/ (stream_01.json – stream_23.json). Each stream runs all 99 queries in a different order for concurrent throughput testing.
# Power test at SF10000
pbench run -s http://localhost:8080 -o results \
benchmarks/tpc-ds/ds_power.json benchmarks/tpc-ds/sf10k.json \
benchmarks/native_oss.json benchmarks/c1w2.json
# Random 50 queries at SF1000
pbench run -s http://localhost:8080 -o results \
benchmarks/tpc-ds/ds_rand50.json benchmarks/tpc-ds/sf1k.json \
benchmarks/native_oss.json benchmarks/c1w2.jsonClickBench is an OLAP benchmark with 43 queries over a single wide hits table of web analytics data.
Directory: benchmarks/clickbench/
43 SQL queries in queries/ (query_01.sql – query_43.sql).
| File | Description |
|---|---|
clickbench.json |
All 43 queries with expected row counts; sets offset_clause_enabled: true
|
Schema: clickbench_parquet.
pbench run -s http://localhost:8080 -o results \
benchmarks/clickbench/clickbench.json \
benchmarks/java_oss.json benchmarks/c1w2.jsonThe Join Order Benchmark (JOB) uses the IMDB dataset to evaluate join ordering and cardinality estimation. It has 113 queries with complex multi-way joins.
Directory: benchmarks/imdb/
113 SQL queries in queries/, named by group and variant (1a.sql, 1b.sql, ..., 33c.sql).
| File | Description |
|---|---|
imdb.json |
All 113 queries; schema imdb
|
pbench run -s http://localhost:8080 -o results \
benchmarks/imdb/imdb.json \
benchmarks/java_oss.json benchmarks/c1w2.jsonTop-level JSON files in benchmarks/ configure engine settings and execution parameters. They are designed to be composed with benchmark stage files via multiple -f arguments or positional args.
These set catalog, session parameters, and pushdown settings for different Presto/Trino variants:
| File | Engine | Catalog |
|---|---|---|
java_oss.json |
Java OSS | Hive |
native_oss.json |
Native OSS | Hive |
java_blueray.json |
Java BlueRay | Hive |
native_blueray.json |
Native BlueRay | Hive |
java_trino.json |
Trino | Hive |
java_oss_glue.json |
Java OSS | Glue |
native_oss_glue.json |
Native OSS | Glue |
java_blueray_glue.json |
Java BlueRay | Glue |
native_blueray_glue.json |
Native BlueRay | Glue |
| File | Description |
|---|---|
c1w2.json |
1 cold run, 2 warm runs |
abort_on_error.json |
Stop on first query failure |
save_output.json |
Save query result output |
save_json.json |
Save query info JSON |
save_colmd.json |
Save column metadata |
Stage files are merged left-to-right, so later files override earlier ones. A typical invocation combines a benchmark, a scale factor, an engine config, and execution settings:
pbench run -s http://localhost:8080 -o results \
benchmarks/tpc-ds/ds_power.json \ # benchmark + queries
benchmarks/tpc-ds/sf1k.json \ # scale factor / schema
benchmarks/native_oss.json \ # engine + session params
benchmarks/c1w2.json # execution settingsbenchmarks/scripts/ contains Python utilities for cache management and database connectivity:
| Script | Description |
|---|---|
presto_utils.py |
Presto/Trino HTTPS connection and query helpers |
mysql_utils.py |
MySQL connection and query helpers |
system_utils.py |
SSH remote command execution (Paramiko) |
cache_cleaning_coordinator.py |
Clear Hive/Iceberg metadata caches on coordinator |
cache_cleaning_workers.py |
Clear SSD, page, and memory caches on workers via SSH |