Name	Name	Last commit message	Last commit date
parent directory ..
.gitkeep	.gitkeep
01_local_metrics.py	01_local_metrics.py
02_llm_as_judge.py	02_llm_as_judge.py
03_rag_evaluation.py	03_rag_evaluation.py
04_guardrails.py	04_guardrails.py
05_streaming.py	05_streaming.py
06_autoeval.py	06_autoeval.py
07_otel_tracing.py	07_otel_tracing.py
09_multimodal_judge.py	09_multimodal_judge.py
README.md	README.md
feedback_loop_demo.py	feedback_loop_demo.py

Name

Last commit message

Last commit date

09_multimodal_judge.py

README.md

feedback_loop_demo.py

fi-evals Cookbooks

Each cookbook solves a real problem you'll face when building AI applications.

#	Cookbook	Problem It Solves	API Keys?
01	Catch a Hallucinating Medical Chatbot	Your chatbot makes up dosages and contradicts source material	No
02	When Heuristics Aren't Enough: LLM-as-Judge	Local metrics miss paraphrases — use Gemini to judge accuracy	Yes (`GOOGLE_API_KEY`)
03	Is Your RAG Pipeline Lying to Users?	Figure out WHERE your RAG fails: retrieval or generation?	No (optional for augmented)
04	Protect Your LLM from Prompt Injection	Block jailbreaks, SQL injection, PII leaks, secret exposure	No
05	Stop Toxic Output Mid-Stream	Cut off LLM output the instant it turns toxic or off-topic	No
06	Auto-Configure Your Testing Pipeline	"What should we test?" — describe your app, get a pipeline	No
07	See Every LLM Call in Your Observability Stack	Trace calls with quality scores in Jaeger/Datadog/Grafana	No
08	Teach Your Judge from Past Mistakes	LLM judge keeps getting the same cases wrong — fix it with feedback	Yes (`GOOGLE_API_KEY`)
09	Judge Images and Audio with Your LLM	Verify AI image descriptions match the actual photo	Yes (`GOOGLE_API_KEY`)

Quick Start

cd python

# Run any cookbook (no API keys needed for 01, 03-07)
uv run python -m examples.01_local_metrics

# For cookbooks that need an LLM (02, 08)
export GOOGLE_API_KEY=your-key
uv run python -m examples.02_llm_as_judge

What You'll Learn

Cookbook 01: Build a validation layer that catches hallucinations, wrong dosages, and contradictions — all locally in <1 second
Cookbook 02: When local heuristics fail on paraphrases, use an LLM judge with augment=True for production-grade accuracy
Cookbook 03: Diagnose RAG failures by measuring retrieval quality (recall, precision) separately from generation quality (faithfulness, groundedness)
Cookbook 04: Build a <10ms security middleware that blocks jailbreaks, code injection, PII exposure, and secret leaks
Cookbook 05: Monitor streaming LLM output token-by-token and kill the stream when safety thresholds are breached
Cookbook 06: Auto-generate test pipelines from app descriptions, customize thresholds, export YAML for CI/CD
Cookbook 07: Wire quality scores into your OTEL traces so you can search for bad responses in Jaeger/Datadog
Cookbook 08: Store developer corrections in ChromaDB, retrieve them as few-shot examples, and teach your LLM judge to not repeat mistakes
Cookbook 09: Pass images and audio URLs to the LLM judge — evaluate image descriptions, UI screenshots, transcriptions with Gemini vision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

fi-evals Cookbooks

Quick Start

What You'll Learn

FilesExpand file tree

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

fi-evals Cookbooks

Quick Start

What You'll Learn