CompactDS

This repository contains the codes for building and obtaining the retrieval results from the datastore in Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks.

Refer to compactds-eval for running evaluations using the retrieval results.

Citation

@article{lyu2025compactds,
  title={Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks},
  author={Xinxi Lyu and Michael Duan and Rulin Shao and Pang Wei Koh and Sewon Min}
  journal={arXiv preprint arXiv:2507.01297},
  year={2025}
}

Announcement

07/01/25: We officially relase the index and the code for CompactDS.

Installation

To create a conda environment scaling with Python 3.11:

conda env create -f environment.yaml
conda activate scaling
huggingface-cli login --token <your_hf_token> # ignore if use custom data

Quick Start

Set up CompactDS

Download the index we built for CompactDS:

bash scripts/download_compactds.sh --output_path datastores/compactds

Run Retreval

Download the search queries we included in the paper for the five datasets: MMLU, MMLU Pro, AGI Eval, GPQA, and Minerva Math.

python scripts/download_queries.py --output_path queries

To obtain the top 1000 documents for each MMLU Pro query from CompactDS:

python -m src.main_ric \
    --config-name CompactDS \
    tasks.eval.search=true \
    datastore.embedding.embedding_dir=datastores/compactds/embeddings \
    datastore.embedding.passages_dir=datastores/compactds/passages \
    evaluation.data.eval_data=queries/mmlu_pro.jsonl \ 
    tasks.eval.task_name=lm-eval \ # Optional
    evaluation.search.n_docs=1000 \ # Optional

To run exact search to rerank the top 1000 documents:

python -m src.main_ric \
    --config-name CompactDS \
    tasks.eval.exact_rerank=true \
    datastore.embedding.embedding_dir=datastores/compactds/embeddings \
    datastore.embedding.passages_dir=datastores/compactds/passages \
    evaluation.data.eval_data=queries/mmlu_pro.jsonl \ 
    tasks.eval.task_name=lm-eval \ # Optional
    evaluation.search.n_docs=1000 \ # Optional

Custom Index Building

Step 0: Configuration and Command Format

We define all the parameters in ric/conf/*.yaml files. At the runtime, you will specify the name of the config file with --config-name.
As an alternative to directly modify the config files, you can also specify the specific parameters in the cli command (e.g., evaluation.data.eval_data=queries/mmlu:mc::retrieval_q.jsonl).

Therefore, the command for any of the following steps will be:

python -m src.main_ric \
    --config-name <config_name> \
    xxx.yyy=zzz \
    ...

Refer to the existing config files for the default settings for our datastores.

Step 1: Vector Building

To build an datastore, the raw text from data sources are required to be chuncked into passages and embeded into a vector space.

Prepare the raw data

The raw data files needs to be jsonl files (can be compressed) each with the following format:

{"text": xxx..., "other_key": ...., ...}
{"text": xxx..., "other_key": ...., ...}
{"text": xxx..., "other_key": ...., ...}

All these jsonl files needs to be put into a single directory (e.g., raw_data/pes2o).
Alternatively, to reproduce CompactDS, download the raw data:

python scripts/download_raw_data.py \
    --output_path raw_data \
    --subfolder_path pes2o  # Remove for downloading the full CompactDS

Build vectors

To build vectors for a single data source (e.g., PeS2o):

python -m src.main_ric \
    --config-name pes2o \
    tasks.datastore.embedding=true \
    datastore.raw_data_path=raw_data/pes2o \
    datastore.embedding.output_dir=datastores/pes2o

For multiple data source, build the vectors for each of them separately. Run bash scripts/build_all_vectors.py raw_data datastores to build vectors for all 10 downloaded CompactDS data sources from raw_data and save the results in datastores.

Important Parameters for Customization

model.datastore_encoder, model.datastore_tokenizer, query_encoder, query_tokenizer: the models / tokenizers used for text embedding. These parameters should all be the same in most cases.
datastore.domain: the customized name of the datastore.
datastore.raw_data_path: the path to the directoy that contains the raw data.
datastore.chunk_size: the number of words to embed into a vector.
datastore.embedding.datastore.no_fp16: use compactds precision if set to True.
datastore.embedding.per_gpu_batch_size: batch size for embedding.
datastore.embedding.output_dir: path to the output dir.

Step 2: Build the Index

We use Faiss to build the index. To make it feasible to deploy the datastore of huge sizes with conventional RAM limit, we used IVFPQ (Inverted File Product Quantization) indices in our paper.

To build the index for single-source vectors (e.g., PeS2o)

Run:

python -m src.main_ric \
    --config-name pes2o \
    tasks.datastore.index=true \
    datastore.embedding.embedding_dir=datastores/pes2o \
    datastore.embedding.passages_dir=datastores/pes2o/passages

To build the index from multiple-source vectors (e.g., full CompactDS)

The vectors and passages need to be aggregated in to the same directories, which can be done by creating symbolic links for vectors from multiple data sources.
To reproduce CompactDS, create symbolic links for vectors from all 10 data sources under datastores into datastores/compactds:

bash create_symlink_vectors.sh datastores datastores/compactds
bash create_symlink_passages.sh datastores datastores/compactds

Now, to perform the index building:

python -m src.main_ric \
    --config-name CompactDS \
    tasks.datastore.index=true \
    datastore.embedding.embedding_dir=datastores/compactds \
    datastore.embedding.passages_dir=datastores/compactds/passages

Important Parameters for Customization

datastore.embedding.embedding_dir: path to the vector files.
datastore.embedding.passages_dir: path to directory that contains raw passage files.
datastore.index.index_type: index type. We use IVFPQ for our paper. Alternatively, setting it to Flat will build an index for exact search without approximation.
datastore.index.ncentroids: number of clusters. Theoretically it is positively correlated with build speed and negatively correlated with search speed. The recommand value is $4\sqrt{n vectors}$ to $8\sqrt{n vectors}$.
datastore.index.n_subquantizers: number of quantizer. Theoretically it is positively correlated with precision and resulting index size (linearly).
datastore.index.sample_train_size: number of the sample size for training the index. The recommanded value is 1% - 10% of the total number of vectors.
datastore.index.n_bits: number of bits per subquantizer for compression.
datastore.index.save_intermediate_index: will save an intermediate index after adding the vectors from each domain if set to True.
datastore.index.deprioritized_domains: list of domains that will be added last during index building.

Custom Queries Search

To search with custom queries, with tasks.eval.task_name=lm-eval the search queries are expected to be in a jsonl file (e.g., your_queries.jsonl) with the following format:

{"query": xxx..., "other_key": ...., ...}
{"query": xxx..., "other_key": ...., ...}
{"query": xxx..., "other_key": ...., ...}

where each json object should contains a field text whose value is a query. Any other field will be perserved in the result file.

Alternatively, change tasks.eval.task_name to support different file format. See details in load_eval_data() in src/data.py.

To perform the search, run:

python -m src.main_ric \
    --config-name CompactDS \
    tasks.eval.search=true \
    tasks.eval.task_name=lm-eval \
    evaluation.data.eval_data=your_queries \ 
    evaluation.search.n_docs=1000

Important Parameters

evaluation.data.eval_data: the path to the query file.
tasks.eval.task_name: used to specify the function to load the query file in src.data.load_eval_data().
evaluation.search.n_docs: number of relevant documents to retriever for each query.
evaluation.search.probe: number of probes. Theoretical it's positively correlated with precision and search speed.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
contriever		contriever
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompactDS

Citation

Announcement

Installation

Quick Start

Set up CompactDS

Run Retreval

Custom Index Building

Step 0: Configuration and Command Format

Step 1: Vector Building

Prepare the raw data

Build vectors

Important Parameters for Customization

Step 2: Build the Index

To build the index for single-source vectors (e.g., PeS2o)

To build the index from multiple-source vectors (e.g., full CompactDS)

Important Parameters for Customization

Custom Queries Search

Important Parameters

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Alrope123/compactds-retrieval

Folders and files

Latest commit

History

Repository files navigation

CompactDS

Citation

Announcement

Installation

Quick Start

Set up CompactDS

Run Retreval

Custom Index Building

Step 0: Configuration and Command Format

Step 1: Vector Building

Prepare the raw data

Build vectors

Important Parameters for Customization

Step 2: Build the Index

To build the index for single-source vectors (e.g., PeS2o)

To build the index from multiple-source vectors (e.g., full CompactDS)

Important Parameters for Customization

Custom Queries Search

Important Parameters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages