feat: AI-powered tool improvement suggestions via Claude Code Action by LOCKhart07 · Pull Request #183 · valory-xyz/mech-predict

LOCKhart07 · 2026-04-03T13:40:48Z

Summary

Adds a suggest-improvements job to the daily benchmark workflow that uses Claude Code Action to identify the worst-performing prediction tool, diagnose prompt issues, apply fixes, and open a PR
Human-in-the-loop: PRs opened by Claude are suggestions, not final — a reviewer must run cached replay to validate improvements before merging
New benchmark/prepare_suggestion.py pre-extracts target tool context (stats, calibration, overconfidence, platform breakdown, comparison tools) into a single file, saving Claude ~5 turns of discovery
benchmark/scorer.py enriched with parse_breakdown_by_tool and overconfidence_by_tool for richer diagnostics
benchmark/CLAUDE.md provides tool registry, scoring reference, data-backed improvement patterns (base-rate anchoring, tail discipline, overconfidence fixes), and guidelines for the AI agent
Graceful turn limit handling: if Claude completes all phases but hits the turn limit, the job still succeeds
Dedup check prevents duplicate PRs/issues for the same tool
Falls back to opening a GitHub issue when no concrete fix is found
suggest-improvements uses same-run scores.json (via actions/download-artifact@v4) so it always targets the correct worst tool
Prompt enforces branch creation (auto-improve/<tool-name>) — never commits directly to main
Prompt specifies exact Phases Completed: All log format so the outcome check works reliably
Slack notification step has continue-on-error: true so failures don't block artifact uploads

Design Decisions

Claude Code Action over a lighter script: Prompt improvement requires reading multiple tool source files, comparing patterns across tools, and making judgment calls about what to change — tasks that benefit from an LLM agent with code access. A rule-based script can't diagnose "weak reasoning scaffolding" or "missing bias correction."
Human-in-the-loop over auto-merge: MVP approach to save CC turns. Claude opens a PR with its diagnosis and fix; a human reviewer validates via cached replay before merging. This avoids the cost of running replay in CI on every suggestion, while keeping a safety gate on prompt changes.
Pre-extracted context (prepare_suggestion.py): Without this, Claude spends ~5 turns discovering files, reading scores, and figuring out which tool to target. Pre-extracting saves turns (and API cost) by handing Claude everything it needs upfront.
One tool per run: Each daily run targets only the single worst-performing tool. This keeps runs focused, avoids conflicting changes, and makes PRs easy to review.
Scheduled runs have suggestions on by default, manual dispatch defaults to off. The repo variable ENABLE_TOOL_SUGGESTIONS acts as a kill switch if needed.

Controls

run_suggestions workflow input toggle (defaults to off for manual dispatch)
ENABLE_TOOL_SUGGESTIONS repo variable kill switch for scheduled runs
BENCHMARK_ANTHROPIC_API_KEY secret required

Tested on fork

Verified end-to-end on LOCKhart07/mech-predict
Latest run (7-day prod data): workflow run — all jobs passed, Claude correctly identified prediction-request-rag-claude as worst tool and opened PR #3
Earlier run: workflow run — Claude targeted superforcaster and opened PR #2

Test plan

Trigger manually with run_suggestions: true
Verify suggestion_log.txt artifact is uploaded
Check that Claude opens PR or issue with auto-improvement label
Reviewer runs cached replay on the PR to validate before merging
Verify dedup: re-run should skip if PR/issue already open
Test kill switch: set ENABLE_TOOL_SUGGESTIONS=false

🤖 Generated with Claude Code

…skip ci] Adds a suggest-improvements job to the benchmark workflow that uses Claude Code Action to identify the worst-performing prediction tool, diagnose prompt issues, apply fixes, validate with cached replay, and open a PR with before/after Brier scores. Includes benchmark/CLAUDE.md with tool registry, scoring reference, and improvement guidelines for the AI agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n [skip ci] - Install via uv sync + autonomy packages sync (matches tournament job) - Add tomte check-code as Phase 5 before committing - Fair cached replay: baseline from existing tournament predictions, only modified prompt gets re-run on same markets - Add [skip ci] to Claude's commit messages - Add run_suggestions manual dispatch toggle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…kip ci] - Phase 0 checks for existing open PRs/issues before creating duplicates - Opens a GitHub issue instead of PR when no fix is found or replay shows no improvement - Both use "auto-improvement" label for tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s [skip ci] Remove cached replay execution from Claude's workflow (was hitting turn limits). PR body now includes validation instructions for the reviewer to run cached replay manually. Reduced max-turns to 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Claude now writes a summary of its actions to suggestion_log.txt, uploaded as an artifact for debugging. Explicitly instructed not to leak secrets in logs, PRs, or issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New benchmark/prepare_suggestion.py: identifies worst tool, checks for duplicate PRs/issues, writes suggestion_context.md with all stats, calibration, platform breakdown, and comparison tools in one file - scorer.py: add parse_breakdown_by_tool and overconfidence_by_tool to scores.json for richer diagnostics - Workflow: add prepare step before Claude, simplify prompt (Claude reads one context file instead of discovering everything), bump to 25 turns with graceful fallback at turn 20 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scores.json uses " | " separator for tool×platform keys, not "/". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Claude step now uses continue-on-error with a follow-up check: if all phases completed despite hitting the turn limit, the job succeeds. Bumped max-turns from 25 to 30 with graceful fallback at turn 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… ci] Adds data-backed improvement patterns from PROMPT_IMPROVEMENT_PLAN.md: base-rate anchoring, tail discipline, information barrier, absence-of- evidence reasoning, and multi-stage skepticism. Notes that multi-stage tools can't be partially tested via cached replay yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rompt [skip ci] - Tell Claude to never commit directly to main, always create a new branch - Require exact "Phases Completed: All" format in suggestion_log.txt so the outcome check grep works correctly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch from dawidd6/action-download-artifact (cross-run) to actions/download-artifact (same-run) so suggest-improvements reads the current run's scores.json instead of stale data from a prior run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents Slack/OpenAI failures from blocking artifact uploads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

langgraph-prebuilt 1.0.9 imports ExecutionInfo from langgraph.runtime which doesn't exist in langgraph 1.0.10. Pinned in both pyproject.toml and tox.ini to match the existing component.yaml pin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LOCKhart07 and others added 10 commits April 3, 2026 19:08

feat: add manual trigger for suggest-improvements job [skip ci]

21f3dfa

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add id-token write permission for Claude Code Action [skip ci]

c325d24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct platform key separator in prepare_suggestion.py [skip ci]

e0055fd

scores.json uses " | " separator for tool×platform keys, not "/". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LOCKhart07 self-assigned this Apr 3, 2026

LOCKhart07 and others added 6 commits April 3, 2026 19:23

fix: add continue-on-error to Slack notify step [skip ci]

36511b9

Prevents Slack/OpenAI failures from blocking artifact uploads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: trigger workflow run

546200c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LOCKhart07 added the do not merge label Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AI-powered tool improvement suggestions via Claude Code Action#183

feat: AI-powered tool improvement suggestions via Claude Code Action#183
LOCKhart07 wants to merge 16 commits intomainfrom
feat/claude-suggest

LOCKhart07 commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LOCKhart07 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Decisions

Controls

Tested on fork

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LOCKhart07 commented Apr 3, 2026 •

edited

Loading