Each cookbook solves a real problem you'll face when building AI applications.
| # | Cookbook | Problem It Solves | API Keys? |
|---|---|---|---|
| 01 | Catch a Hallucinating Medical Chatbot | Your chatbot makes up dosages and contradicts source material | No |
| 02 | When Heuristics Aren't Enough: LLM-as-Judge | Local metrics miss paraphrases — use Gemini to judge accuracy | Yes (GOOGLE_API_KEY) |
| 03 | Is Your RAG Pipeline Lying to Users? | Figure out WHERE your RAG fails: retrieval or generation? | No (optional for augmented) |
| 04 | Protect Your LLM from Prompt Injection | Block jailbreaks, SQL injection, PII leaks, secret exposure | No |
| 05 | Stop Toxic Output Mid-Stream | Cut off LLM output the instant it turns toxic or off-topic | No |
| 06 | Auto-Configure Your Testing Pipeline | "What should we test?" — describe your app, get a pipeline | No |
| 07 | See Every LLM Call in Your Observability Stack | Trace calls with quality scores in Jaeger/Datadog/Grafana | No |
| 08 | Teach Your Judge from Past Mistakes | LLM judge keeps getting the same cases wrong — fix it with feedback | Yes (GOOGLE_API_KEY) |
| 09 | Judge Images and Audio with Your LLM | Verify AI image descriptions match the actual photo | Yes (GOOGLE_API_KEY) |
cd python
# Run any cookbook (no API keys needed for 01, 03-07)
uv run python -m examples.01_local_metrics
# For cookbooks that need an LLM (02, 08)
export GOOGLE_API_KEY=your-key
uv run python -m examples.02_llm_as_judge- Cookbook 01: Build a validation layer that catches hallucinations, wrong dosages, and contradictions — all locally in <1 second
- Cookbook 02: When local heuristics fail on paraphrases, use an LLM judge with
augment=Truefor production-grade accuracy - Cookbook 03: Diagnose RAG failures by measuring retrieval quality (recall, precision) separately from generation quality (faithfulness, groundedness)
- Cookbook 04: Build a <10ms security middleware that blocks jailbreaks, code injection, PII exposure, and secret leaks
- Cookbook 05: Monitor streaming LLM output token-by-token and kill the stream when safety thresholds are breached
- Cookbook 06: Auto-generate test pipelines from app descriptions, customize thresholds, export YAML for CI/CD
- Cookbook 07: Wire quality scores into your OTEL traces so you can search for bad responses in Jaeger/Datadog
- Cookbook 08: Store developer corrections in ChromaDB, retrieve them as few-shot examples, and teach your LLM judge to not repeat mistakes
- Cookbook 09: Pass images and audio URLs to the LLM judge — evaluate image descriptions, UI screenshots, transcriptions with Gemini vision