A small, realistic LLM evaluation lab you can clone and run locally.
This repo simulates the kind of internal eval pipeline a team would use to answer:
“Which model should we use for this task, and how is it changing over time?”
It focuses on reasoning-style questions (e.g. basic statistics / numeracy), and compares multiple providers (OpenAI, Anthropic, Google) end-to-end:
- Synthesize a dataset of questions using an LLM.
- Run multiple models on that dataset.
- Store everything in DuckDB in a raw schema.
- Transform with dbt into an analytics schema.
- Compute metrics (F1, cosine similarity, token usage, latency).
- Explore results with SQL + charts.
- Provider-agnostic runners
- OpenAIRunner, ClaudeRunner, GeminiRunner with a common ModelRunner protocol.
- LLM-generated evaluation datasets
- SynthesizerModel uses structured output (Pydantic) to generate JSONL datasets with metadata.
- Warehouse-style storage
- DuckDB database with raw and analytics schemas.
- dbt-duckdb for transforms, joins, and metrics.
- Task + metrics for reasoning
- Reasoning/statistics Q&A task.
- Token-level F1, cosine similarity, token usage, latency.
- Incremental analytics
- dbt models materialized incrementally, so new runs append cleanly.
