Debug, benchmark, and monitor your RAG retrieval layer. EXPLAIN ANALYZE for production RAG.
Quickstart • Commands • Why RagTune • Concepts • FAQ
| I want to... | Command |
|---|---|
| Debug a single query | ragtune explain "my query" --collection prod |
| Run batch evaluation | ragtune simulate --collection prod --queries queries.json |
| Get confidence intervals | ragtune simulate --queries queries.json --bootstrap 20 |
| Set up CI/CD quality gates | ragtune simulate --ci --min-recall 0.85 |
| Detect regressions | ragtune simulate --baseline runs/latest.json --fail-on-regression |
| Compare embedders | ragtune compare --embedders ollama,openai --docs ./docs |
| Evaluate external chunkers | ragtune ingest ./chunks/ --collection test --pre-chunked |
| Find missed answer content | ragtune simulate --queries needles.json (with needle annotations) |
| Quick health check | ragtune audit --collection prod --queries queries.json |
# 1. Start vector store
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
# 2. Ingest documents
ragtune ingest ./docs --collection my-docs --embedder ollama
# 3. Debug retrieval
ragtune explain "How do I reset my password?" --collection my-docsNo API keys needed with Ollama (runs locally).
Already chunked your documents with an external tool? Use --pre-chunked to ingest them as-is — one file per chunk, no re-splitting:
# Ingest pre-chunked data (each file = one embedding unit)
ragtune ingest ./poma-chunksets/ --collection poma-test --embedder ollama --pre-chunked
# Compare against naive chunking
ragtune ingest ./raw-docs/ --collection naive-test --embedder ollama --chunk-size 512
# Benchmark both
ragtune simulate --collection poma-test --queries queries.json --bootstrap 20
ragtune simulate --collection naive-test --queries queries.json --bootstrap 20Skip Docker entirely. Use your existing database:
ragtune ingest ./docs --collection my-docs --embedder ollama \
--store pgvector --pgvector-url postgres://user:pass@localhost/mydb
ragtune explain "How do I reset my password?" --collection my-docs \
--store pgvector --pgvector-url postgres://user:pass@localhost/mydb# Save queries as you debug
ragtune explain "How do I reset my password?" --collection my-docs --save
ragtune explain "What are the rate limits?" --collection my-docs --save
# Run evaluation once you have 20+ queries
ragtune simulate --collection my-docs --queries golden-queries.jsonEach --save adds the query to golden-queries.json.
Query: "How do I reset my password?"
[1] Score: 0.8934 | Source: docs/auth/password-reset.md
Text: To reset your password: 1. Click "Forgot Password"...
[2] Score: 0.8521 | Source: docs/auth/account-security.md
Text: Account Security ## Password Management...
DIAGNOSTICS
Score range: 0.7234 - 0.8934 (spread: 0.1700)
✓ Strong top match (>0.85): likely high-quality retrieval
Running 50 queries...
Recall@5: 0.82 MRR: 0.76 Coverage: 0.94
Latency: p50=45ms p95=120ms
FAILURES: 3 queries with Recall@5 = 0
✗ "How do I configure SSO?"
Expected: [sso-guide.md], Retrieved: [api-keys.md...]
💡 Run `ragtune explain "<query>"` to debug
Recall@K tells you whether the right document was retrieved. But a document can be retrieved and still miss the specific paragraph that actually answers the question — especially with structured or legal text where the relevant content is scattered across sections.
NeedleCoverage@K checks whether specific text spans ("needles") required to answer a query are present in the retrieved chunks. Just add needles to your queries file:
{
"queries": [{
"id": "gdpr_fines",
"text": "What fines can be imposed under the GDPR?",
"relevant_docs": ["gdpr.txt"],
"needles": [
{"text": "up to 20 000 000 EUR", "source": "Art 83(5)"},
{"text": "up to 10 000 000 EUR", "source": "Art 83(4)"}
]
}]
}ragtune simulate --collection prod --queries needles.json --embedder ollama Recall@5: 1.000 # Right doc? Yes, always.
NeedleCoverage@5: 0.280 # Right content? Only 28% of the time.
No new flags needed — if your queries have needles, the metric appears automatically. Queries without needles work exactly as before.
| Command | Purpose |
|---|---|
ingest |
Load documents into vector store |
explain |
Debug retrieval for a single query |
simulate |
Batch benchmark with metrics, needle coverage + CI mode |
compare |
Compare embedders or chunk sizes |
audit |
Quick health check (pass/fail) |
report |
Generate markdown reports |
import-queries |
Import queries from CSV/JSON |
See CLI Reference for all flags and options.
# .github/workflows/rag-quality.yml
- name: RAG Quality Gate
run: |
ragtune ingest ./docs --collection ci-test --embedder ollama
ragtune simulate --collection ci-test --queries tests/golden-queries.json \
--ci --min-recall 0.85 --min-coverage 0.90 --max-latency-p95 500Exit code 1 if thresholds fail. See examples/github-actions.yml for complete setup.
Compare against a baseline to catch regressions before they reach production:
# Compare current run against baseline
ragtune simulate --collection prod --queries golden.json \
--baseline runs/baseline.json --fail-on-regressionOutput shows deltas for each metric:
BASELINE COMPARISON
Comparing against: 2026-01-15T12:00:00Z
─────────────────────────────────────────────────────────────
Recall@5: 0.900 → 0.850 ↓ 5.6% (REGRESSED)
MRR: 0.800 → 0.820 ↑ 2.5% (improved)
Coverage: 0.950 → 0.950 = 0.0% (unchanged)
Latency p95: 100ms → 120ms ↑ 20.0% (REGRESSED)
─────────────────────────────────────────────────────────────
❌ REGRESSION DETECTED
The following metrics decreased: [Recall@5, Latency p95]
RAG retrieval is a configuration problem: chunk size, embedding model, index type, top-k. Most teams tune by intuition. RagTune provides the measurement layer to make these decisions empirically, using standard IR metrics (Recall@k, MRR, NDCG) on your actual data.
| What Matters | Impact |
|---|---|
| Domain-appropriate chunking | 7%+ recall difference |
| Embedding model choice | 5% difference |
| Continuous monitoring | Catches data drift before users do |
RagTune focuses on retrieval debugging, monitoring, and benchmarking, not end-to-end answer evaluation.
| RagTune | Ragas / DeepEval | misbahsy/RAGTune | |
|---|---|---|---|
| Focus | Retrieval layer | Full pipeline | Full pipeline |
| LLM calls | None required | Required | Required |
| Interface | CLI (CI/CD-native) | Python library | Streamlit UI |
| Speed | Fast (embedding only) | Slow (LLM inference) | Slow |
| CI/CD | First-class | Manual setup | None |
Use RagTune when: debugging retrieval, CI/CD quality gates, comparing embedders, deterministic benchmarks.
Use other tools when: evaluating LLM answer quality, you need answer_relevancy metrics.
Retrieval failures are silent. No error, no exception. Just gradually worse answers.
- Users complaining about "wrong answers" but you can't reproduce it
- No idea if that embedding change made things better or worse
- Retrieval was "good" in dev, failing in production
- You added documents but answers got worse
- Can't tell if the LLM is hallucinating or retrieval is broken
If any of these sound familiar:
ragtune explain "the query that's failing" --collection prod# Homebrew (macOS/Linux)
brew install metawake/tap/ragtune
# Go Install
go install github.com/metawake/ragtune/cmd/ragtune@latest
# Or download binary from GitHub ReleasesPrerequisites: Docker (for Qdrant), Ollama or API key for embeddings.
| Embedder | Setup | Best For |
|---|---|---|
ollama |
Local, no API key | Development, privacy |
openai |
OPENAI_API_KEY |
General purpose |
voyage |
VOYAGE_API_KEY |
Legal, code (domain-tuned) |
cohere |
COHERE_API_KEY |
Multilingual |
tei |
Docker container | High throughput |
| Store | Setup |
|---|---|
| Qdrant (default) | docker run -p 6333:6333 qdrant/qdrant |
| pgvector | --store pgvector --pgvector-url postgres://... |
| Weaviate | --store weaviate --weaviate-host localhost:8080 |
| Chroma | --store chroma --chroma-url http://localhost:8000 |
| Pinecone | --store pinecone --pinecone-host HOST |
| Dataset | Documents | Purpose |
|---|---|---|
data/ |
9 | Quick testing |
benchmarks/hotpotqa-1k/ |
398 | General knowledge |
benchmarks/casehold-500/ |
500 | Legal domain |
benchmarks/synthetic-50k/ |
50,000 | Scale testing |
# Try it
ragtune ingest ./benchmarks/hotpotqa-1k/corpus --collection demo --embedder ollama
ragtune simulate --collection demo --queries ./benchmarks/hotpotqa-1k/queries.json| Guide | Description |
|---|---|
| Concepts | RAG basics, metrics explained |
| CLI Reference | All commands and flags |
| Quickstart | Step-by-step setup guide |
| Benchmarking Guide | Scale testing, runtimes |
| Deployment Patterns | CI/CD, production |
| FAQ | Common questions |
| Troubleshooting | Common issues and fixes |
Contributions welcome. Please open an issue first to discuss significant changes.
MIT
