RagTune

Debug, benchmark, and monitor your RAG retrieval layer. EXPLAIN ANALYZE for production RAG.

Quickstart • Commands • Why RagTune • Concepts • FAQ

I want to...	Command
Debug a single query	`ragtune explain "my query" --collection prod`
Run batch evaluation	`ragtune simulate --collection prod --queries queries.json`
Get confidence intervals	`ragtune simulate --queries queries.json --bootstrap 20`
Set up CI/CD quality gates	`ragtune simulate --ci --min-recall 0.85`
Detect regressions	`ragtune simulate --baseline runs/latest.json --fail-on-regression`
Compare embedders	`ragtune compare --embedders ollama,openai --docs ./docs`
Evaluate external chunkers	`ragtune ingest ./chunks/ --collection test --pre-chunked`
Find missed answer content	`ragtune simulate --queries needles.json` (with needle annotations)
Quick health check	`ragtune audit --collection prod --queries queries.json`

Quickstart

# 1. Start vector store
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

# 2. Ingest documents
ragtune ingest ./docs --collection my-docs --embedder ollama

# 3. Debug retrieval
ragtune explain "How do I reset my password?" --collection my-docs

No API keys needed with Ollama (runs locally).

Evaluate External Chunkers (POMA, Unstructured, LlamaIndex)

Already chunked your documents with an external tool? Use --pre-chunked to ingest them as-is — one file per chunk, no re-splitting:

# Ingest pre-chunked data (each file = one embedding unit)
ragtune ingest ./poma-chunksets/ --collection poma-test --embedder ollama --pre-chunked

# Compare against naive chunking
ragtune ingest ./raw-docs/ --collection naive-test --embedder ollama --chunk-size 512

# Benchmark both
ragtune simulate --collection poma-test --queries queries.json --bootstrap 20
ragtune simulate --collection naive-test --queries queries.json --bootstrap 20

Already using PostgreSQL with pgvector?

Skip Docker entirely. Use your existing database:

ragtune ingest ./docs --collection my-docs --embedder ollama \
    --store pgvector --pgvector-url postgres://user:pass@localhost/mydb

ragtune explain "How do I reset my password?" --collection my-docs \
    --store pgvector --pgvector-url postgres://user:pass@localhost/mydb

Build Your Test Suite

# Save queries as you debug
ragtune explain "How do I reset my password?" --collection my-docs --save
ragtune explain "What are the rate limits?" --collection my-docs --save

# Run evaluation once you have 20+ queries
ragtune simulate --collection my-docs --queries golden-queries.json

Each --save adds the query to golden-queries.json.

What You'll See

explain — Debug a Query

Query: "How do I reset my password?"

[1] Score: 0.8934 | Source: docs/auth/password-reset.md
    Text: To reset your password: 1. Click "Forgot Password"...

[2] Score: 0.8521 | Source: docs/auth/account-security.md
    Text: Account Security ## Password Management...

DIAGNOSTICS
  Score range: 0.7234 - 0.8934 (spread: 0.1700)
  ✓ Strong top match (>0.85): likely high-quality retrieval

simulate — Batch Metrics

Running 50 queries...

  Recall@5:   0.82    MRR: 0.76    Coverage: 0.94
  Latency:    p50=45ms  p95=120ms

FAILURES: 3 queries with Recall@5 = 0
  ✗ "How do I configure SSO?"
    Expected: [sso-guide.md], Retrieved: [api-keys.md...]

💡 Run `ragtune explain "<query>"` to debug

NeedleCoverage — Find What Recall Misses

Recall@K tells you whether the right document was retrieved. But a document can be retrieved and still miss the specific paragraph that actually answers the question — especially with structured or legal text where the relevant content is scattered across sections.

NeedleCoverage@K checks whether specific text spans ("needles") required to answer a query are present in the retrieved chunks. Just add needles to your queries file:

{
  "queries": [{
    "id": "gdpr_fines",
    "text": "What fines can be imposed under the GDPR?",
    "relevant_docs": ["gdpr.txt"],
    "needles": [
      {"text": "up to 20 000 000 EUR", "source": "Art 83(5)"},
      {"text": "up to 10 000 000 EUR", "source": "Art 83(4)"}
    ]
  }]
}

ragtune simulate --collection prod --queries needles.json --embedder ollama

  Recall@5:          1.000    # Right doc? Yes, always.
  NeedleCoverage@5:  0.280    # Right content? Only 28% of the time.

No new flags needed — if your queries have needles, the metric appears automatically. Queries without needles work exactly as before.

Commands

Command	Purpose
`ingest`	Load documents into vector store
`explain`	Debug retrieval for a single query
`simulate`	Batch benchmark with metrics, needle coverage + CI mode
`compare`	Compare embedders or chunk sizes
`audit`	Quick health check (pass/fail)
`report`	Generate markdown reports
`import-queries`	Import queries from CSV/JSON

See CLI Reference for all flags and options.

CI/CD Quality Gates

# .github/workflows/rag-quality.yml
- name: RAG Quality Gate
  run: |
    ragtune ingest ./docs --collection ci-test --embedder ollama
    ragtune simulate --collection ci-test --queries tests/golden-queries.json \
      --ci --min-recall 0.85 --min-coverage 0.90 --max-latency-p95 500

Exit code 1 if thresholds fail. See examples/github-actions.yml for complete setup.

Regression Testing

Compare against a baseline to catch regressions before they reach production:

# Compare current run against baseline
ragtune simulate --collection prod --queries golden.json \
  --baseline runs/baseline.json --fail-on-regression

Output shows deltas for each metric:

BASELINE COMPARISON
Comparing against: 2026-01-15T12:00:00Z
─────────────────────────────────────────────────────────────
  Recall@5:    0.900 → 0.850  ↓ 5.6%  (REGRESSED)
  MRR:         0.800 → 0.820  ↑ 2.5%  (improved)
  Coverage:    0.950 → 0.950  = 0.0%  (unchanged)
  Latency p95: 100ms → 120ms  ↑ 20.0%  (REGRESSED)
─────────────────────────────────────────────────────────────

❌ REGRESSION DETECTED
   The following metrics decreased: [Recall@5, Latency p95]

Why RagTune?

RAG retrieval is a configuration problem: chunk size, embedding model, index type, top-k. Most teams tune by intuition. RagTune provides the measurement layer to make these decisions empirically, using standard IR metrics (Recall@k, MRR, NDCG) on your actual data.

What Matters	Impact
Domain-appropriate chunking	7%+ recall difference
Embedding model choice	5% difference
Continuous monitoring	Catches data drift before users do

RagTune vs. Other Tools

RagTune focuses on retrieval debugging, monitoring, and benchmarking, not end-to-end answer evaluation.

	RagTune	Ragas / DeepEval	misbahsy/RAGTune
Focus	Retrieval layer	Full pipeline	Full pipeline
LLM calls	None required	Required	Required
Interface	CLI (CI/CD-native)	Python library	Streamlit UI
Speed	Fast (embedding only)	Slow (LLM inference)	Slow
CI/CD	First-class	Manual setup	None

Use RagTune when: debugging retrieval, CI/CD quality gates, comparing embedders, deterministic benchmarks.

Use other tools when: evaluating LLM answer quality, you need answer_relevancy metrics.

Signs You Need This

Retrieval failures are silent. No error, no exception. Just gradually worse answers.

Users complaining about "wrong answers" but you can't reproduce it
No idea if that embedding change made things better or worse
Retrieval was "good" in dev, failing in production
You added documents but answers got worse
Can't tell if the LLM is hallucinating or retrieval is broken

If any of these sound familiar:

ragtune explain "the query that's failing" --collection prod

Installation

# Homebrew (macOS/Linux)
brew install metawake/tap/ragtune

# Go Install
go install github.com/metawake/ragtune/cmd/ragtune@latest

# Or download binary from GitHub Releases

Prerequisites: Docker (for Qdrant), Ollama or API key for embeddings.

Embedders

Embedder	Setup	Best For
`ollama`	Local, no API key	Development, privacy
`openai`	`OPENAI_API_KEY`	General purpose
`voyage`	`VOYAGE_API_KEY`	Legal, code (domain-tuned)
`cohere`	`COHERE_API_KEY`	Multilingual
`tei`	Docker container	High throughput

Vector Stores

Store	Setup
Qdrant (default)	`docker run -p 6333:6333 qdrant/qdrant`
pgvector	`--store pgvector --pgvector-url postgres://...`
Weaviate	`--store weaviate --weaviate-host localhost:8080`
Chroma	`--store chroma --chroma-url http://localhost:8000`
Pinecone	`--store pinecone --pinecone-host HOST`

Included Benchmarks

Dataset	Documents	Purpose
`data/`	9	Quick testing
`benchmarks/hotpotqa-1k/`	398	General knowledge
`benchmarks/casehold-500/`	500	Legal domain
`benchmarks/synthetic-50k/`	50,000	Scale testing

# Try it
ragtune ingest ./benchmarks/hotpotqa-1k/corpus --collection demo --embedder ollama
ragtune simulate --collection demo --queries ./benchmarks/hotpotqa-1k/queries.json

Documentation

Guide	Description
Concepts	RAG basics, metrics explained
CLI Reference	All commands and flags
Quickstart	Step-by-step setup guide
Benchmarking Guide	Scale testing, runtimes
Deployment Patterns	CI/CD, production
FAQ	Common questions
Troubleshooting	Common issues and fixes

Contributing

Contributions welcome. Please open an issue first to discuss significant changes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
assets		assets
benchmarks		benchmarks
cmd/ragtune		cmd/ragtune
data		data
docs		docs
examples		examples
internal		internal
runs/needle-experiment		runs/needle-experiment
scripts		scripts
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
demo.tape		demo.tape
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
report.md		report.md
run-benchmark.sh		run-benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RagTune

Quickstart

Evaluate External Chunkers (POMA, Unstructured, LlamaIndex)

Already using PostgreSQL with pgvector?

Build Your Test Suite

What You'll See

explain — Debug a Query

simulate — Batch Metrics

NeedleCoverage — Find What Recall Misses

Commands

CI/CD Quality Gates

Regression Testing

Why RagTune?

RagTune vs. Other Tools

Signs You Need This

Installation

Embedders

Vector Stores

Included Benchmarks

Documentation

Contributing

License

About

Uh oh!

Releases 4

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RagTune

Quickstart

Evaluate External Chunkers (POMA, Unstructured, LlamaIndex)

Already using PostgreSQL with pgvector?

Build Your Test Suite

What You'll See

explain — Debug a Query

simulate — Batch Metrics

NeedleCoverage — Find What Recall Misses

Commands

CI/CD Quality Gates

Regression Testing

Why RagTune?

RagTune vs. Other Tools

Signs You Need This

Installation

Embedders

Vector Stores

Included Benchmarks

Documentation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages