This repository was archived by the owner on Mar 3, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 17
benchmarks: Flesh out scripts for the SQL benchmark #762
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Benchmarks of Kaskada vs. SQL | ||
|
|
||
| Use the `generate.py` to create test data: | ||
|
|
||
| ```bash | ||
| python generate.py --users 500 --items 500 --purchases 10000 --page_views 5000 --reviews 2500 | ||
| ``` | ||
|
|
||
| This will generate two Parquet files in the current directory. | ||
|
|
||
| ## Kaskada (Fenl) | ||
|
|
||
| Run the statements in the `kaskada.ipynb` using the latest Fenl-supporting Python client. | ||
| The RPCs should return the "query time" within each result table. | ||
|
|
||
| ## DuckDB | ||
|
|
||
| Run the queries from `queries_duckdb.sql` one at a time. | ||
| With the `enable_profiling` pragma each should report the execution time. | ||
|
|
||
| ## DataFusion | ||
|
|
||
| The SQL statements aren't ready yet -- they run, but we haven't figured out how to write the results to Parquet to measure end-to-end time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| -- Run with `datafusion-cli --data-path <directory>`. | ||
| -- Then send these commands. | ||
|
|
||
| -- Load the data. | ||
| -- These tables weren't usable, so just used inline definitions. | ||
| -- | ||
| -- CREATE EXTERNAL TABLE Purchases | ||
| -- STORED AS parquet | ||
| -- LOCATION 'purchases.parquet'; | ||
|
|
||
| -- CREATE EXTERNAL TABLE Reviews | ||
| -- STORED AS parquet | ||
| -- LOCATION 'reviews.parquet'; | ||
|
|
||
| -- CREATE EXTERNAL TABLE PageViews | ||
| -- STORED AS parquet | ||
| -- LOCATION 'page_views.parquet'; | ||
|
|
||
| -- Aggregation / History | ||
| COPY (SELECT | ||
| user, | ||
| time, | ||
| SUM(amount) OVER ( | ||
| PARTITION BY user | ||
| ORDER BY time | ||
| ) | ||
| FROM 'purchases.parquet') TO 'output/agg_history_df.parquet'; | ||
|
|
||
| -- Aggregation / Snapshot | ||
| COPY (SELECT | ||
| user, | ||
| SUM(amount) | ||
| FROM Purchases | ||
| GROUP BY user) TO 'output/agg_snapshot_duckdb.parquet'; | ||
|
|
||
| -- Time-Windowed Aggregation / History | ||
| COPY (SELECT | ||
| user, | ||
| time, | ||
| sum(amount) OVER ( | ||
| PARTITION BY | ||
| user, | ||
| time_bucket(INTERVAL '1 month', time) | ||
| ORDER BY time | ||
| ) | ||
| FROM Purchases | ||
| ORDER BY time) TO 'output/windowed_history_duckdb.parquet'; | ||
|
|
||
| -- Time-Windowed Aggregation / Snapshot | ||
| COPY (SELECT | ||
| user, | ||
| sum(amount) | ||
| FROM Purchases | ||
| WHERE time_bucket(INTERVAL '1 month', time) >= time_bucket(INTERVAL '1 month', DATE '2022-05-03') | ||
| GROUP BY user) TO 'output/windowed_snapshot_duckdb.parquet'; | ||
|
|
||
| -- Data-Defined Windowed Aggregation / History | ||
| COPY (WITH activity AS ( | ||
| (SELECT user, time, 1 as is_page_view FROM PageViews) | ||
| UNION | ||
| (SELECT user, time, 0 as is_page_view FROM Purchases) | ||
| ), purchase_counts AS ( | ||
| SELECT | ||
| user, time, is_page_view, | ||
| SUM(CASE WHEN is_page_view = 0 THEN 1 ELSE 0 END) | ||
| OVER (PARTITION BY user ORDER BY time) AS purchase_count | ||
| FROM activity | ||
| ), page_views_since_purchase AS ( | ||
| SELECT | ||
| user, time, | ||
| SUM(CASE WHEN is_page_view = 1 THEN 1 ELSE 0 END) | ||
| OVER (PARTITION BY user, purchase_count ORDER BY time) AS views | ||
| FROM purchase_counts | ||
| ) | ||
| SELECT user, time, | ||
| AVG(views) OVER (PARTITION BY user ORDER BY time) | ||
| as avg_views_since_purchase | ||
| FROM page_views_since_purchase | ||
| ORDER BY time) TO 'output/data_defined_history_duckdb.parquet'; | ||
|
|
||
| -- Temporal Join / Snapshot [Spline] | ||
| -- | ||
| -- Not reported -- ASOF join is more efficient. | ||
| COPY (WITH review_avg AS ( | ||
| SELECT item, time, | ||
| AVG(rating) OVER (PARTITION BY item ORDER BY time) as avg_score | ||
| FROM Reviews | ||
| ), review_times AS ( | ||
| SELECT item, review_avg.time AS time, review_avg.time AS r_time, | ||
| CAST(NULL AS TIMESTAMP) as p_time | ||
| FROM review_avg | ||
| ), purchase_times AS ( | ||
| SELECT item, Purchases.time as time, Purchases.time as p_time, | ||
| CAST(NULL AS TIMESTAMP) AS r_time, | ||
| FROM Purchases | ||
| ), all_times AS ( | ||
| (SELECT * FROM review_times) UNION (SELECT * FROM purchase_times) | ||
| ), spline AS ( | ||
| SELECT item, time, max(r_time) OVER w AS last_r_time, | ||
| FROM all_times | ||
| WINDOW w AS (PARTITION BY item ORDER BY time) | ||
| ) | ||
| SELECT user, Purchases.time, avg_score | ||
| FROM Purchases | ||
| LEFT JOIN spline | ||
| ON Purchases.time = spline.time AND Purchases.item = spline.item | ||
| LEFT JOIN review_avg | ||
| ON spline.last_r_time = review_avg.time | ||
| AND Purchases.item = review_avg.item) TO 'output/temporal_join_spline_snapshot_duckdb.parquet'; | ||
|
|
||
| -- Temporal Join / Snapshot [ASOF Join] | ||
| COPY (WITH review_avg AS ( | ||
| SELECT item, time, | ||
| AVG(rating) OVER (PARTITION BY item ORDER BY time) as avg_score | ||
| FROM Reviews | ||
| ) | ||
| SELECT p.user, p.time, r.avg_score | ||
| FROM review_avg r ASOF RIGHT JOIN Purchases p | ||
| ON p.item = r.item AND r.time >= p.time) TO 'output/temporal_join_asof_snapshot_duckdb.parquet'; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.