Awesome-OLAP-Paper

Introduction

A curated paper list of awesome Online Analytical Processing database systems, theory, frameworks, resources, tools and other awesomeness, for database researchers/engineers.

Contributing

The repository is under construction. Welcome new PR, please conform to the committed rules:

paperName(with pdf link) (alias) [MeetingName Year] Github link if it has open-sourced code (optional)

Acknowledge

Thanks to all authors of the paper/repository I cite :D

Table of Content

Awesome-OLAP-Paper

Query-Aware Database Generation

QAGen: Generating Query-Aware Test Databases [SIGMOD 07]
Generating Targeted Queries for Database Testing [SIGMOD 08]
Generating Databases for Query Workloads [VLDB 10]
Data Generation using Declarative Constraints [SIGMOD 11]
MyBenchmark: generating databases for query workloads [VLDB 14]
Scalable and Dynamic Regeneration of Big Data Volumes [EDBT 18]
Touchstone: Generating Enormous Query-Aware Test Databases [OSDI 18]
Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
Projection-Compliant Database Generation [VLDB 22]
SAM: Database Generation from Query Workloads with Supervised Autoregressive Models [SIGMOD 22]
Mirage: Generating Enormous Databases for Complex Workloads [ICDE 24]
Query Aware Database Generation for Match Operators [DASFAA 24]
Controllable Tabular Data Synthesis Using Diffusion Models [SIGMOD 24]
A Query-Aware Enormous Database Generator For System Performance Evaluation [SIGMOD 25]

Privacy

PrivSyn: Differentially Private Data Synthesis [ATC 21]
Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
Data Synthesis via Differentially Private Markov Random Fields [VLDB 21]
PrivLava: Synthesizing Relational Data with Foreign Keys under Differential Privacy [SIGMOD 23]
Privacy-Enhanced Database Synthesis for Benchmark Publishing [VLDB 25]

Survey

Synthetic Data Generation for Enterprise DBMS [ICDE 23]

Query Schedule

Query Optimization

Sampling-Based Query Re-Optimization [SIGMOD 16]
Leveraging Re-costing for Online Optimization of Parameterized Queries with Guarantees [SIGMOD 17]
Adaptive Optimization of Very Large Join Queries [SIGMOD 18]
Efficient Massively Parallel Join Optimization for Large Queries [SIGMOD 22]
Leveraging Query Logs and Machine Learning for Parametric Query Optimization [VLDB 22]
Rethink Query Optimization in HTAP Databases [SIGMOD 24]
SPQO: Learning to Safely Reuse Cached Plans for Dynamic Workloads [DASFAA 24]
Optimizing Nested Recursive Queries [SIGMOD 24]
Efficient Enumeration of Recursive Plans in Transformation-based Query Optimizers [VLDB 24]
Presto's History-based Query Optimizer [VLDB 24]
RankPQO: Learning-to-Rank for Parametric Query Optimization [VLDB 25]

Robust Query Optimization

Robust query processing through progressive optimization [SIGMOD 04]
Robust Query Optimization Methods With Respect to Estimation Errors: A Survey [SIGMOD 15]
Efficient Query Re-optimization with Judicious Subquery Selections [SIGMDO 23]
ROME: Robust Query Optimization via Parallel Multi-Plan Execution [SIGMOD 24]

Query Rewrite

QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting [VLDB 23]
SlabCity: Whole-Query Optimization using Program Synthesis [VLDB 23]
GEqO: ML-Accelerated Semantic Equivalence Detection [SIGMOD 24]
Proving Query Equivalence Using Linear Integer Arithmetic [SIGMOD 24]
QED: A Powerful Query Equivalence Decider for SQL [VLDB 24]
VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints [OOPSLA 24]
PoneglyphDB: Efficient Non-interactive Zero-Knowledge Proofs for Arbitrary SQL-Query Verification [SIGMOD 25]
Query Weak Equivalence and Its Verification in Analytical Databases [ICDE 25]
Proving Cypher Query Equivalence [ICDE 25]

Cardinality Estimation

Histogram

Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries [None 87]
Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results [ACM Transactions on Database Systems 93]
Selectivity Estimation Without the Attribute Value Independence Assumption (MHIST) [SIGMOD 97]
On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications [ICDT 99]
Approximating multi-dimensional aggregate range queries over real attributes (GENHIST) [SIGMOD 00]
Independence is good: Dependency-based histogram synopses for high-dimensional data (DBHist) [SIGMOD 01]
STHoles: a multidimensional workload-aware histogram [SIGMOD 01]
Selectivity Estimation using Probabilistic Models[SIGMOD 01]
A multi-dimensional histogram for selectivity estimation and fast approximate query answering [CASCON 03]
The history of histograms (abridged) [VLDB 03]
SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads[VLDB 03]
Selectivity estimators for multidimensional range queries over real attributes (GENHIST) [VLDB 03]
ISOMER: Consistent histogram construction using query feedback [ICDE 06]
Join Over Histograms [Alberto Dell'Era 07]
Consistent Histograms In The Presence of Distinct Value Counts [VLDB 08]
Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions [VLDB 11]
Efficiently adapting graphical models for selectivity estimation [VLDB 13]
Improving Accuracy and Robustness of Self-Tuning Histograms by Subspace Clustering [TKDE 15]
TKHist: Cardinality Estimation for Join Queries via Histograms with Dominant Attribute Correlation Finding [arXiv 25]

Sampling

Two-Level Sampling for Join Size Estimation [SIGMOD 17]
Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing [SIGMOD 21]

Learn Data Distribution Function

Cost Model

View

Foreign Keys Open the Door for Faster Incremental View Maintenance [SIGMOD 23]

Survey

Index

SQL Server Column Store Indexes [SIGMOD 11]
The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases [ICDE 13]
Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation [SIGMOD 18]
CUBIT: Concurrent Updatable Bitmap Indexing [VLDB 25]
B-Trees Are Back: Engineering Fast and Pageable Node Layouts [SIGMOD 25]

Query Execution

MonetDB/X100: Hyper-Pipelining Query Execution [CIDR 05]
DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing [DaMoN 08]
Materialization Strategies in the Vertica Analytic Database: Lessons Learned [ICDE 13]
Adaptive Query Processing in the Looking Glass [CIDR 15]
Rethinking SIMD Vectorization for In-Memory Databases [SIGMOD 15]
Efficient Processing of Window Functions in Analytical SQL Queries [VLDB 15]
Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? [SIGMOD 17]
Looking Ahead Makes Query Plans Robust [VLDB 17]
Building Advanced SQL Analytics From Low-Level Plan Operators [SIGMOD 21]
SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore Platforms [VLDB 22]
ChainedFilter: Combining Membership Filters by Chain Rule [SIGMOD 24]
Saving Money for Analytical Workloads in the Cloud [VLDB 24]
Adaptive and Robust Query Execution for Lakehouses at Scale [VLDB 24]
DuckDB-SGX2: The Good, The Bad and The Ugly within Confidential Analytical Query Processing [DaMoN 24]
The Key to Effective UDF Optimization: Before Inlining, First Perform Outlining [VLDB 25]
High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance [SIGMOD 25]
Data Chunk Compaction in Vectorized Execution [SIGMOD 25]
FAAQP: Fast and Accurate Approximate Query Processing based on Bitmap-augmented Sum-Product Network [SIGMOD 25]
OLTP in the Cloud: Architectures, Tradeoffs, and Cost [VLDB 25]

Data Dependency Search

Discovering Functional Dependencies through Hitting Set Enumeration [SIGMOD 24]

Query Compilation

How to Architect a Query Compiler [SIGMOD 16]
Adaptive Execution of Compiled Queries [ICDE 18]

Bugs Detection

Functional Bug

Logical Bug

Search-Based Test Data Generation for SQL Queries [ICSE 18]
Finding Bugs in Database Systems via Query Partitioning [OOPSLA 20]
Detecting Optimization Bugs in Database Engines via Non-Optimizing Reference Engine Construction [FSE 20]
Testing Database Engines via Query Plan Guidance [ICSE 23]
GDsmith: Detecting Bugs in Cypher Graph Database Engines [ISSTA 23]
Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor [SOSP 23]
Detecting Isolation Bugs via Transaction Oracle Construction [ICSE 23]
Detecting Logic Bugs of Join Optimizations in DBMS [SIGMOD 23 Best Paper]
Fonte: Finding Bug Inducing Commits from Failures [ICSE 23]
Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database Construction [VLDB 24]
CONI: Detecting Database Connector Bugs via State-Aware Test Case Generation [ICSE 24]
WINGFUZZ: Implementing Continuous Fuzzing for DBMSs [ATC 24]
Keep It Simple: Testing Databases via Differential Query Plans [SIGMOD 24]
Plume: Efficient and Complete Black-Box Checking of Weak Isolation Levels [OOPSLA2 2024]
DBStorm: Generating Various Effective Workloads for Testing Isolation Levels [ISSTA 24]
SQLaser: Detecting DBMS Logic Bugs with Clause-Guided Fuzzing [arXiv 24]
Understanding and Detecting SQL Function Bugs [EuroSys 25]
Understanding and Reusing Test Suites Across Database Systems [SIGMOD 25]
Detecting Logic Bugs in Database Engines via Equivalent Expression Transformation [ATC 24]
THANOS: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing [ICSE 25]
Semantic Conformance Testing of Relational DBMS [VLDB 25]
Automatic Database Configuration Debugging using Retrieval-Augmented Language Models [SIGMOD 25]
Finding Logic Bugs in Spatial Database Engines via Affine Equivalent Input [SIGMOD 25]
Constant Optimization Driven Database System Testing [SIGMOD 25]
Blackbox Fuzzing of Distributed Systems with Multi-Dimensional Inputs and Symmetry-Based Feedback Pruning [NDSS 25]
Finding Logic Bugs in Graph-processing Systems via Graph-cutting [SIGMOD 25]
Model Checking Guided Incremental Testing for Distributed Systems [ISSTA 25]
Scaling Automated Database System Testing [arXiv 25]
Testing Database Systems with Large Language Model Synthesized Fragments [arXiv 25]
Detecting Schema-Related Logic Bugs in Relational DBMSs via Equivalent Database Construction [VLDB 25]
Simple Testing Can Expose Most Critical Transaction Bugs: Understanding and Detecting Write-Specific Serializability Violations in Database Systems [VLDB 25]
Detecting Isolation Anomalies in Relational DBMSs [ISSTA 25]
Vbox: Efficient Black-Box Serializability Verification [arXiv 25]
Fucci: Database Transaction Fuzzing via Random Conflict Construction and Multilevel Constraint Solving [VLDB 25]
DDLUMOS: Understanding and Detecting Atomic DDL Bugs in DBMSs [ATC 25]
Detecting Logic Bugs in DBMSs via Equivalent Data Construction [SIGMOD 25]
SRS: Detecting Logic Bugs of Join Implementation in DBMSs via Set Relation Synthesis [SIGMOD 25]
ARG: Testing Query Rewriters via Abstract Rule Guided Fuzzing [ASE 25]
Anomaly Pattern-guided Transaction Bug Testing in Relational Databases [SIGMOD 26]

Crash Bug

Performance Bug

Survey

A Comprehensive Survey on Database Management System Fuzzing: Techniques, Taxonomy and Experimental Comparison [arXiv 23]
Survey on Database Management System Fuzzing Techniques [Journal of Software 24]

Static Analysis

Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach [PACMPL 24]

Casual Inference

Code Location

Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces [arXiv 25]

Reduction

SQLess: Dialect-Agnostic SQL Query Simplification [ISSTA 24]

Storage

What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines [VLDB 23]
An Empirical Evaluation of Columnar Storage Formats [VLDB 24]
Leco: Lightweight compression via learning serial correlations [SIGMOD 24]
Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine [SIGMOD 24]
NULLS! Revisiting Null Representation in Modern Columnar Formats [DaMoN 24]
Boosting OLTP Performance with Per-Page Logging on NVDIMM [SIGMOD 25]
Data formats in analytical DBMSs: performance trade-offs and future directions [VLDBJ 25]
Data chunk compaction in vectorized execution [SIGMOD 25]
Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings [arXiv 25]
Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility [VLDB 25]
F3: The Open-Source Data File Format for the Future [SIGMOD 26]

LSM-Tree

Dissecting, Designing, and Optimizing LSM-based Data Stores [SIGMOD 22 Tutorial]
Magma: A High Data Density Storage Engine Used in Couchbase [VLDB 22]
CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated Infrastructure [SIGMOD 24]
CAMAL: Optimizing LSM-trees via Active Learning [SIGMOD 25]
Disco: A Compact Index for LSM-trees [SIGMOD 25]
Randomized Sketches for Quantile in LSM-tree based Store [SIGMOD 25]
Rethinking The Compaction Policies in LSM-trees [SIGMOD 25]
DFlush: DPU-Offloaded Flush for Disaggregated LSM-based Key-Value Stores [SIGMOD 25]
Rethinking LSM-tree based Key-Value Stores: A Survey [arXiv 25]
How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and Practice [SIGMOD 25]

Kd-Tree

Parallel kd-tree with Batch Updates [SIGMOD 25]

Proxy

Tigger: A Database Proxy That Bounces With User-Bypass [VLDB 23]

Data Transfer

Fast and Scalable Data Transfer Across Data Systems [SIGMOD 25]

Data Loading

ConnectorX: Accelerating Data Loading From Databases to Dataframes [VLDB 22]

Database Kernel

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics [CIDR 21]
Disaggregated Database Systems [VLDB 23 Tutorial]
GPU Database Systems Characterization and Optimization [VLDB 24]
The Art of Latency Hiding in Modern Database Engines [VLDB 24]
DoppelGanger++: Towards Fast Dependency Graph Generation for Database Replay [SIGMOD 24]
Rapid Data Ingestion through DB-OS Co-design [SIGMOD 25]
Practical DB-OS Co-Design with Privileged Kernel Bypass [SIGMOD 25]

Cloud

Survey

OLTP in the cloud: architectures, tradeoffs, and cost [VLDBJ 25]

Optimization

Principles and Methodologies for Serial Performance Optimization [OSDI 25]

Transactions

Survey

Others

Performance Optimization

Principles and Methodologies for Serial Performance Optimization [OSDI 25]

MVCC

Scalable Garbage Collection for In-Memory MVCC Systems [VLDB 13]
Rethinking serializable multiversion concurrency control [VLDB 15]
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control [VLDB 17]
Accelerating Analytical Processing in MVCC using Fine-Granular High-Frequency Virtual Snapshotting [SIGMOD 18]
Long-lived Transactions Made Less Harmful [SIGMOD 20]
Rethink the Scan in MVCC Databases [SIGMOD 21]
Diva: Making MVCC Systems HTAP-Friendly [SIGMOD 22]
Memory-Optimized Multi-Version Concurrency Control for Disk-Based Database Systems [VLDB 22]
Scalable and Robust Snapshot Isolation for High-Performance Storage Engines [VLDB 23]
One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version Storage [SIGMOD 23]

HTAP

System Architecture

Linear Consistency

Sequential Consistency

Session Consistency

Survey

HTAP Databases: What is New and What is Next [SIGMOD 22]
Data Sharing Model and Optimization Strategies in HTAP Database Systems [Journal of Software 23]
HTAP Databases: A Survey [TKDE 24]
A survey on hybrid transactional and analytical processing [VLDB Journal 24]
Survey on Benchmarking Ability of HTAP Benchmarks [Journal of Software 24]

Kernel Optimization

Result Replay

Benchmark

Survey

Surprise Benchmarking: The Why, What, and How [DBTest 24]

AI

TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems [VLDB 23]

OLTP

Dike: A Benchmark Suite for Distributed Transactional Databases [SIGMOD 23]
DBPA: A Benchmark for Transactional Database Performance Anomalies [SIGMOD 23]

OLAP

TPC-DS, Taking Decision Support Benchmarking to the Next Level [SIGMOD 02]
Generating Thousands of Benchmark Queries in Seconds [VLDB 04]
The Making of TPC-DS [VLDB 06]
Why You Should Run TPC-DS: A Workload Analysis [VLDB 07]
Introducing Skew into the TPC-H Benchmark [21]

HTAP

Cloud

Cloud Analytics Benchmark [VLDB 23]
PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking [VLDB 25]
CloudyBench: A Testbed for A Comprehensive Evaluation of Cloud-Native Databases [ICDE 25]
Redbench: Workload Synthesis From Cloud Traces [VLDB 26]

Others

Time Series

An Experimental Evaluation of Anomaly Detection in Time Series [VLDB 24]

Multi-Model

Multi-Modal

Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization [arXiv 25]

Survey & Benchmark

Vector Database

Survey

Algorithm

Distributed Systems

Consistency in Non-Transactional Distributed Storage Systems [arXiv 15]
NOC-NOC: Towards Performance-optimal Distributed Transactions [SIGMOD 24]
Native Distributed Databases: Problems, Challenges and Opportunities [VLDB 24 Tutorial]

OLTP

Survey

A survey on transactional stream processing [VLDBJ 23]
Are Database System Researchers Making Correct Assumptions about Transaction Workloads? [SIGMOD 25]

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
Ai4DB-Paper @ e076b91		Ai4DB-Paper @ e076b91
DBGiant-Industry-Paper @ b946da0		DBGiant-Industry-Paper @ b946da0
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

License

Wind-Gone/awesome-olap-paper

Folders and files

Latest commit

History

Repository files navigation

Awesome-OLAP-Paper

Introduction

Contributing

Acknowledge

Table of Content

Query-Aware Database Generation

Privacy

Survey

Query Schedule

Query Optimization

Robust Query Optimization

Query Rewrite

Cardinality Estimation

Histogram

Sampling

Learn Data Distribution Function

Others

Survey

Special Case

Join Order

Join Algorithms

Sub-Query

Cost Model

View

Survey

Index

Query Execution

Data Dependency Search

Query Compilation

Bugs Detection

Functional Bug

Logical Bug

Crash Bug

Performance Bug

Survey

Static Analysis

Casual Inference

Code Location

Reduction

Storage

LSM-Tree

Kd-Tree

Proxy

Data Transfer

Data Loading

Database Kernel

Cloud

Survey

Optimization

Transactions

Survey

Others

Performance Optimization

MVCC

HTAP

System Architecture

Linear Consistency

Sequential Consistency

Session Consistency

Survey

Kernel Optimization

Result Replay

Benchmark

Survey

AI

OLTP

OLAP

HTAP

Cloud

Others

Time Series

Multi-Model

Multi-Modal

Survey & Benchmark

Packages