Skip to content

lucharo/eval-claude

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-claude

Run inspect_ai evals through the Claude Code CLI — use your Claude Pro/Max/Team subscription instead of per-token API billing.

Based on UKGovernmentBEIS/inspect_ai#2986, extracted as a standalone pip-installable package.

Live dashboard: lucharo.github.io/eval-claude — GPQA Diamond results updated twice daily across 5 Claude models.

Why?

Model providers sometimes ship regressions (e.g. Anthropic's summer incident, GPT-4's 2023 decline). This package lets you run standard benchmarks for virtually free using your existing Claude subscription — no API keys or per-token billing needed.

Install

uv add eval-claude

Or from source:

uv sync

This installs inspect_ai, inspect_evals (standard benchmark suite), and the claude-code/ provider.

Prerequisites

  • Claude Code CLI: npm install -g @anthropic-ai/claude-code
  • Authenticated: claude auth
  • Active Claude Pro/Max/Team subscription

Quick start

No need to clone — run directly with uvx:

uvx --from eval-claude inspect eval inspect_evals/arc_easy --model claude-code/haiku --limit 5 -M max_connections=5

Usage

# Basic
uv run inspect eval inspect_evals/arc_easy --model claude-code/sonnet --limit 10

# Parallel (10 concurrent CLI processes)
uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/opus -M max_connections=10

# Extended thinking
uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/sonnet -M thinking_level=ultrathink

# Let Claude Code pick the default model
uv run inspect eval task.py --model claude-code/default

Benchmark results

All benchmarks run on a ThinkPad with Claude Code CLI v2.0.76, February 2026.

ARC Easy (5 samples)

╭──────────────────────────────────────────────────────────────────────────────╮
│arc_easy (5 samples): claude-code/haiku                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
total time:                    0:00:25

choice
accuracy  1.000
stderr    0.000

GPQA Diamond (50 samples, all models)

for model in haiku sonnet opus; do
  uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/$model --limit 50 -M max_connections=10
done

Haiku 4.5 (61.5% +/- 5.9%, 12:42)

╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/haiku                              │
╰──────────────────────────────────────────────────────────────────────────────╯
total time:                 0:12:42

choice
accuracy  0.615
stderr    0.059

Sonnet 4.5 (78.5% +/- 4.8%, 13:20)

╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/sonnet                             │
╰──────────────────────────────────────────────────────────────────────────────╯
total time:                  0:13:20

choice
accuracy  0.785
stderr    0.048

Opus 4.5 (86.0% +/- 4.5%, 14:13)

╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/opus                               │
╰──────────────────────────────────────────────────────────────────────────────╯
total time:                0:14:13

choice
accuracy  0.860
stderr    0.045

Results show the expected model ranking: Haiku < Sonnet < Opus.

Provider comparison: anthropic/ vs claude-code/

Feature anthropic/ claude-code/
Billing Per-token API Subscription (Pro/Max/Team)
Tool/function calling Full support Not supported
Vision/images Yes No
Streaming Yes No
Concurrent requests Configurable Via max_connections
Extended thinking Fine-grained (up to 200k tokens) Coarse-grained via thinking_level
Token usage Real counts Real counts
Cost tracking Via API From CLI JSON

Model args

Arg Default Description
skip_permissions True Skip permission prompts (--dangerously-skip-permissions)
timeout 300 CLI timeout in seconds
max_connections 1 Concurrent CLI processes
thinking_level "none" "none", "think" (~4k tokens), "megathink" (~10k), "ultrathink" (~32k)

Extended thinking

The CLI uses magic words to trigger thinking budgets (Simon Willison's blog):

thinking_level Approx. tokens
none (default) 0
think ~4,000
megathink ~10,000
ultrathink ~32,000

Less granular than the API's budget_tokens parameter (up to 200k).

Model names

Passed directly to the CLI. Accepts aliases (sonnet, opus, haiku) and full model IDs (claude-sonnet-4-5-20250929). Use --model claude-code/default to let Claude Code choose its default model.

Environment

  • CLAUDE_CODE_COMMAND — override the CLI path (default: claude from PATH)

Implementation notes

  • CLI discovery: Supports CLAUDE_CODE_COMMAND env var for custom paths
  • Model names: Passed directly to the CLI — it handles aliases and full model IDs natively
  • Token usage: Extracted from the CLI's JSON output (--output-format json)
  • Cost & timing: Extracted from CLI JSON (total_cost_usd, duration_ms, duration_api_ms, session_id)
  • Tools disabled: Uses --tools "" to disable Claude Code's built-in tools for clean eval responses

Development

uv sync --extra dev
uv run pytest tests/ -v

License

MIT

About

Run inspect_ai evals via Claude Code CLI — use your Claude subscription instead of per-token API billing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages