Skip to content

feat: AI-powered tool improvement suggestions via Claude Code Action#183

Open
LOCKhart07 wants to merge 16 commits intomainfrom
feat/claude-suggest
Open

feat: AI-powered tool improvement suggestions via Claude Code Action#183
LOCKhart07 wants to merge 16 commits intomainfrom
feat/claude-suggest

Conversation

@LOCKhart07
Copy link
Copy Markdown
Member

@LOCKhart07 LOCKhart07 commented Apr 3, 2026

Summary

  • Adds a suggest-improvements job to the daily benchmark workflow that uses Claude Code Action to identify the worst-performing prediction tool, diagnose prompt issues, apply fixes, and open a PR
  • Human-in-the-loop: PRs opened by Claude are suggestions, not final — a reviewer must run cached replay to validate improvements before merging
  • New benchmark/prepare_suggestion.py pre-extracts target tool context (stats, calibration, overconfidence, platform breakdown, comparison tools) into a single file, saving Claude ~5 turns of discovery
  • benchmark/scorer.py enriched with parse_breakdown_by_tool and overconfidence_by_tool for richer diagnostics
  • benchmark/CLAUDE.md provides tool registry, scoring reference, data-backed improvement patterns (base-rate anchoring, tail discipline, overconfidence fixes), and guidelines for the AI agent
  • Graceful turn limit handling: if Claude completes all phases but hits the turn limit, the job still succeeds
  • Dedup check prevents duplicate PRs/issues for the same tool
  • Falls back to opening a GitHub issue when no concrete fix is found
  • suggest-improvements uses same-run scores.json (via actions/download-artifact@v4) so it always targets the correct worst tool
  • Prompt enforces branch creation (auto-improve/<tool-name>) — never commits directly to main
  • Prompt specifies exact Phases Completed: All log format so the outcome check works reliably
  • Slack notification step has continue-on-error: true so failures don't block artifact uploads

Design Decisions

  • Claude Code Action over a lighter script: Prompt improvement requires reading multiple tool source files, comparing patterns across tools, and making judgment calls about what to change — tasks that benefit from an LLM agent with code access. A rule-based script can't diagnose "weak reasoning scaffolding" or "missing bias correction."
  • Human-in-the-loop over auto-merge: MVP approach to save CC turns. Claude opens a PR with its diagnosis and fix; a human reviewer validates via cached replay before merging. This avoids the cost of running replay in CI on every suggestion, while keeping a safety gate on prompt changes.
  • Pre-extracted context (prepare_suggestion.py): Without this, Claude spends ~5 turns discovering files, reading scores, and figuring out which tool to target. Pre-extracting saves turns (and API cost) by handing Claude everything it needs upfront.
  • One tool per run: Each daily run targets only the single worst-performing tool. This keeps runs focused, avoids conflicting changes, and makes PRs easy to review.
  • Scheduled runs have suggestions on by default, manual dispatch defaults to off. The repo variable ENABLE_TOOL_SUGGESTIONS acts as a kill switch if needed.

Controls

  • run_suggestions workflow input toggle (defaults to off for manual dispatch)
  • ENABLE_TOOL_SUGGESTIONS repo variable kill switch for scheduled runs
  • BENCHMARK_ANTHROPIC_API_KEY secret required

Tested on fork

Test plan

  • Trigger manually with run_suggestions: true
  • Verify suggestion_log.txt artifact is uploaded
  • Check that Claude opens PR or issue with auto-improvement label
  • Reviewer runs cached replay on the PR to validate before merging
  • Verify dedup: re-run should skip if PR/issue already open
  • Test kill switch: set ENABLE_TOOL_SUGGESTIONS=false

🤖 Generated with Claude Code

LOCKhart07 and others added 10 commits April 3, 2026 19:08
…skip ci]

Adds a suggest-improvements job to the benchmark workflow that uses
Claude Code Action to identify the worst-performing prediction tool,
diagnose prompt issues, apply fixes, validate with cached replay, and
open a PR with before/after Brier scores.

Includes benchmark/CLAUDE.md with tool registry, scoring reference,
and improvement guidelines for the AI agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n [skip ci]

- Install via uv sync + autonomy packages sync (matches tournament job)
- Add tomte check-code as Phase 5 before committing
- Fair cached replay: baseline from existing tournament predictions,
  only modified prompt gets re-run on same markets
- Add [skip ci] to Claude's commit messages
- Add run_suggestions manual dispatch toggle

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…kip ci]

- Phase 0 checks for existing open PRs/issues before creating duplicates
- Opens a GitHub issue instead of PR when no fix is found or replay
  shows no improvement
- Both use "auto-improvement" label for tracking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s [skip ci]

Remove cached replay execution from Claude's workflow (was hitting
turn limits). PR body now includes validation instructions for the
reviewer to run cached replay manually. Reduced max-turns to 15.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude now writes a summary of its actions to suggestion_log.txt,
uploaded as an artifact for debugging. Explicitly instructed not to
leak secrets in logs, PRs, or issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New benchmark/prepare_suggestion.py: identifies worst tool, checks for
  duplicate PRs/issues, writes suggestion_context.md with all stats,
  calibration, platform breakdown, and comparison tools in one file
- scorer.py: add parse_breakdown_by_tool and overconfidence_by_tool to
  scores.json for richer diagnostics
- Workflow: add prepare step before Claude, simplify prompt (Claude reads
  one context file instead of discovering everything), bump to 25 turns
  with graceful fallback at turn 20

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scores.json uses " | " separator for tool×platform keys, not "/".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude step now uses continue-on-error with a follow-up check: if all
phases completed despite hitting the turn limit, the job succeeds.
Bumped max-turns from 25 to 30 with graceful fallback at turn 25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LOCKhart07 LOCKhart07 self-assigned this Apr 3, 2026
LOCKhart07 and others added 6 commits April 3, 2026 19:23
… ci]

Adds data-backed improvement patterns from PROMPT_IMPROVEMENT_PLAN.md:
base-rate anchoring, tail discipline, information barrier, absence-of-
evidence reasoning, and multi-stage skepticism. Notes that multi-stage
tools can't be partially tested via cached replay yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rompt [skip ci]

- Tell Claude to never commit directly to main, always create a new branch
- Require exact "Phases Completed: All" format in suggestion_log.txt so the
  outcome check grep works correctly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from dawidd6/action-download-artifact (cross-run) to
actions/download-artifact (same-run) so suggest-improvements reads
the current run's scores.json instead of stale data from a prior run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents Slack/OpenAI failures from blocking artifact uploads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
langgraph-prebuilt 1.0.9 imports ExecutionInfo from langgraph.runtime
which doesn't exist in langgraph 1.0.10. Pinned in both pyproject.toml
and tox.ini to match the existing component.yaml pin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant