GitHub - dunkeln/llm-evals-lab

A small, realistic LLM evaluation lab you can clone and run locally.

This repo simulates the kind of internal eval pipeline a team would use to answer:

“Which model should we use for this task, and how is it changing over time?”

It focuses on reasoning-style questions (e.g. basic statistics / numeracy), and compares multiple providers (OpenAI, Anthropic, Google) end-to-end:

Provider-agnostic runners
OpenAIRunner, ClaudeRunner, GeminiRunner with a common ModelRunner protocol.
LLM-generated evaluation datasets
SynthesizerModel uses structured output (Pydantic) to generate JSONL datasets with metadata.
Warehouse-style storage
DuckDB database with raw and analytics schemas.
dbt-duckdb for transforms, joins, and metrics.
Task + metrics for reasoning
Reasoning/statistics Q&A task.
Token-level F1, cosine similarity, token usage, latency.
Incremental analytics
dbt models materialized incrementally, so new runs append cleanly.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data		data
llm_evals_dbt		llm_evals_dbt
logs		logs
notebooks		notebooks
src		src
storage		storage
tests		tests
warehouse		warehouse
.DS_Store		.DS_Store
.gitignore		.gitignore
DOCS.md		DOCS.md
Makefile		Makefile
README.md		README.md

Provide feedback