feat: AI-powered tool improvement suggestions via Claude Code Action#183
Open
LOCKhart07 wants to merge 16 commits intomainfrom
Open
feat: AI-powered tool improvement suggestions via Claude Code Action#183LOCKhart07 wants to merge 16 commits intomainfrom
LOCKhart07 wants to merge 16 commits intomainfrom
Conversation
…skip ci] Adds a suggest-improvements job to the benchmark workflow that uses Claude Code Action to identify the worst-performing prediction tool, diagnose prompt issues, apply fixes, validate with cached replay, and open a PR with before/after Brier scores. Includes benchmark/CLAUDE.md with tool registry, scoring reference, and improvement guidelines for the AI agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n [skip ci] - Install via uv sync + autonomy packages sync (matches tournament job) - Add tomte check-code as Phase 5 before committing - Fair cached replay: baseline from existing tournament predictions, only modified prompt gets re-run on same markets - Add [skip ci] to Claude's commit messages - Add run_suggestions manual dispatch toggle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…kip ci] - Phase 0 checks for existing open PRs/issues before creating duplicates - Opens a GitHub issue instead of PR when no fix is found or replay shows no improvement - Both use "auto-improvement" label for tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s [skip ci] Remove cached replay execution from Claude's workflow (was hitting turn limits). PR body now includes validation instructions for the reviewer to run cached replay manually. Reduced max-turns to 15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude now writes a summary of its actions to suggestion_log.txt, uploaded as an artifact for debugging. Explicitly instructed not to leak secrets in logs, PRs, or issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New benchmark/prepare_suggestion.py: identifies worst tool, checks for duplicate PRs/issues, writes suggestion_context.md with all stats, calibration, platform breakdown, and comparison tools in one file - scorer.py: add parse_breakdown_by_tool and overconfidence_by_tool to scores.json for richer diagnostics - Workflow: add prepare step before Claude, simplify prompt (Claude reads one context file instead of discovering everything), bump to 25 turns with graceful fallback at turn 20 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scores.json uses " | " separator for tool×platform keys, not "/". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude step now uses continue-on-error with a follow-up check: if all phases completed despite hitting the turn limit, the job succeeds. Bumped max-turns from 25 to 30 with graceful fallback at turn 25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ci] Adds data-backed improvement patterns from PROMPT_IMPROVEMENT_PLAN.md: base-rate anchoring, tail discipline, information barrier, absence-of- evidence reasoning, and multi-stage skepticism. Notes that multi-stage tools can't be partially tested via cached replay yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rompt [skip ci] - Tell Claude to never commit directly to main, always create a new branch - Require exact "Phases Completed: All" format in suggestion_log.txt so the outcome check grep works correctly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from dawidd6/action-download-artifact (cross-run) to actions/download-artifact (same-run) so suggest-improvements reads the current run's scores.json instead of stale data from a prior run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents Slack/OpenAI failures from blocking artifact uploads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
langgraph-prebuilt 1.0.9 imports ExecutionInfo from langgraph.runtime which doesn't exist in langgraph 1.0.10. Pinned in both pyproject.toml and tox.ini to match the existing component.yaml pin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
suggest-improvementsjob to the daily benchmark workflow that uses Claude Code Action to identify the worst-performing prediction tool, diagnose prompt issues, apply fixes, and open a PRbenchmark/prepare_suggestion.pypre-extracts target tool context (stats, calibration, overconfidence, platform breakdown, comparison tools) into a single file, saving Claude ~5 turns of discoverybenchmark/scorer.pyenriched withparse_breakdown_by_toolandoverconfidence_by_toolfor richer diagnosticsbenchmark/CLAUDE.mdprovides tool registry, scoring reference, data-backed improvement patterns (base-rate anchoring, tail discipline, overconfidence fixes), and guidelines for the AI agentscores.json(viaactions/download-artifact@v4) so it always targets the correct worst toolauto-improve/<tool-name>) — never commits directly to mainPhases Completed: Alllog format so the outcome check works reliablycontinue-on-error: trueso failures don't block artifact uploadsDesign Decisions
prepare_suggestion.py): Without this, Claude spends ~5 turns discovering files, reading scores, and figuring out which tool to target. Pre-extracting saves turns (and API cost) by handing Claude everything it needs upfront.ENABLE_TOOL_SUGGESTIONSacts as a kill switch if needed.Controls
run_suggestionsworkflow input toggle (defaults to off for manual dispatch)ENABLE_TOOL_SUGGESTIONSrepo variable kill switch for scheduled runsBENCHMARK_ANTHROPIC_API_KEYsecret requiredTested on fork
prediction-request-rag-claudeas worst tool and opened PR #3superforcasterand opened PR #2Test plan
run_suggestions: trueauto-improvementlabelENABLE_TOOL_SUGGESTIONS=false🤖 Generated with Claude Code