Skip to content

dunkeln/llm-evals-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Evals Lab

screenshot

A small, realistic LLM evaluation lab you can clone and run locally.

This repo simulates the kind of internal eval pipeline a team would use to answer:

“Which model should we use for this task, and how is it changing over time?”

It focuses on reasoning-style questions (e.g. basic statistics / numeracy), and compares multiple providers (OpenAI, Anthropic, Google) end-to-end:

  1. Synthesize a dataset of questions using an LLM.
  2. Run multiple models on that dataset.
  3. Store everything in DuckDB in a raw schema.
  4. Transform with dbt into an analytics schema.
  5. Compute metrics (F1, cosine similarity, token usage, latency).
  6. Explore results with SQL + charts.

Features

  • Provider-agnostic runners
  • OpenAIRunner, ClaudeRunner, GeminiRunner with a common ModelRunner protocol.
  • LLM-generated evaluation datasets
  • SynthesizerModel uses structured output (Pydantic) to generate JSONL datasets with metadata.
  • Warehouse-style storage
  • DuckDB database with raw and analytics schemas.
  • dbt-duckdb for transforms, joins, and metrics.
  • Task + metrics for reasoning
  • Reasoning/statistics Q&A task.
  • Token-level F1, cosine similarity, token usage, latency.
  • Incremental analytics
  • dbt models materialized incrementally, so new runs append cleanly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors