Run inspect_ai evals through the Claude Code CLI — use your Claude Pro/Max/Team subscription instead of per-token API billing.
Based on UKGovernmentBEIS/inspect_ai#2986, extracted as a standalone pip-installable package.
Live dashboard: lucharo.github.io/eval-claude — GPQA Diamond results updated twice daily across 5 Claude models.
Model providers sometimes ship regressions (e.g. Anthropic's summer incident, GPT-4's 2023 decline). This package lets you run standard benchmarks for virtually free using your existing Claude subscription — no API keys or per-token billing needed.
uv add eval-claudeOr from source:
uv syncThis installs inspect_ai, inspect_evals (standard benchmark suite), and the claude-code/ provider.
- Claude Code CLI:
npm install -g @anthropic-ai/claude-code - Authenticated:
claude auth - Active Claude Pro/Max/Team subscription
No need to clone — run directly with uvx:
uvx --from eval-claude inspect eval inspect_evals/arc_easy --model claude-code/haiku --limit 5 -M max_connections=5# Basic
uv run inspect eval inspect_evals/arc_easy --model claude-code/sonnet --limit 10
# Parallel (10 concurrent CLI processes)
uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/opus -M max_connections=10
# Extended thinking
uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/sonnet -M thinking_level=ultrathink
# Let Claude Code pick the default model
uv run inspect eval task.py --model claude-code/defaultAll benchmarks run on a ThinkPad with Claude Code CLI v2.0.76, February 2026.
╭──────────────────────────────────────────────────────────────────────────────╮
│arc_easy (5 samples): claude-code/haiku │
╰──────────────────────────────────────────────────────────────────────────────╯
total time: 0:00:25
choice
accuracy 1.000
stderr 0.000
for model in haiku sonnet opus; do
uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/$model --limit 50 -M max_connections=10
doneHaiku 4.5 (61.5% +/- 5.9%, 12:42)
╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/haiku │
╰──────────────────────────────────────────────────────────────────────────────╯
total time: 0:12:42
choice
accuracy 0.615
stderr 0.059
Sonnet 4.5 (78.5% +/- 4.8%, 13:20)
╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/sonnet │
╰──────────────────────────────────────────────────────────────────────────────╯
total time: 0:13:20
choice
accuracy 0.785
stderr 0.048
Opus 4.5 (86.0% +/- 4.5%, 14:13)
╭──────────────────────────────────────────────────────────────────────────────╮
│gpqa_diamond (50 x 4 samples): claude-code/opus │
╰──────────────────────────────────────────────────────────────────────────────╯
total time: 0:14:13
choice
accuracy 0.860
stderr 0.045
Results show the expected model ranking: Haiku < Sonnet < Opus.
| Feature | anthropic/ |
claude-code/ |
|---|---|---|
| Billing | Per-token API | Subscription (Pro/Max/Team) |
| Tool/function calling | Full support | Not supported |
| Vision/images | Yes | No |
| Streaming | Yes | No |
| Concurrent requests | Configurable | Via max_connections |
| Extended thinking | Fine-grained (up to 200k tokens) | Coarse-grained via thinking_level |
| Token usage | Real counts | Real counts |
| Cost tracking | Via API | From CLI JSON |
| Arg | Default | Description |
|---|---|---|
skip_permissions |
True |
Skip permission prompts (--dangerously-skip-permissions) |
timeout |
300 |
CLI timeout in seconds |
max_connections |
1 |
Concurrent CLI processes |
thinking_level |
"none" |
"none", "think" (~4k tokens), "megathink" (~10k), "ultrathink" (~32k) |
The CLI uses magic words to trigger thinking budgets (Simon Willison's blog):
thinking_level |
Approx. tokens |
|---|---|
none (default) |
0 |
think |
~4,000 |
megathink |
~10,000 |
ultrathink |
~32,000 |
Less granular than the API's budget_tokens parameter (up to 200k).
Passed directly to the CLI. Accepts aliases (sonnet, opus, haiku) and full model IDs (claude-sonnet-4-5-20250929). Use --model claude-code/default to let Claude Code choose its default model.
CLAUDE_CODE_COMMAND— override the CLI path (default:claudefrom PATH)
- CLI discovery: Supports
CLAUDE_CODE_COMMANDenv var for custom paths - Model names: Passed directly to the CLI — it handles aliases and full model IDs natively
- Token usage: Extracted from the CLI's JSON output (
--output-format json) - Cost & timing: Extracted from CLI JSON (
total_cost_usd,duration_ms,duration_api_ms,session_id) - Tools disabled: Uses
--tools ""to disable Claude Code's built-in tools for clean eval responses
uv sync --extra dev
uv run pytest tests/ -vMIT