improve: superforcaster prompt (Brier 0.3897) by claude[bot] · Pull Request #2 · LOCKhart07/mech-predict

claude · 2026-04-03T15:58:08Z

Current Brier Score

0.3897 (accuracy 46%, sharpness 0.28, n=59 valid predictions)

What Was Wrong

1. Conflicting output format (critical)

The prompt asked for XML-tagged reasoning steps (steps 1–7) AND then said "Output only the JSON object. Do not include any other contents in your response." This direct contradiction caused the model to skip reasoning and jump straight to JSON, defeating the purpose of chain-of-thought scaffolding.

2. max_tokens=500 (critical)

500 tokens is far too small for 6 reasoning steps with XML tags plus JSON output. This caused truncated responses and parsing failures (59 valid out of 65 = 9% failure rate).

3. Severe overconfidence at high probabilities

Calibration data shows catastrophic overconfidence in the 0.9–1.0 range:

Predicted avg: 0.955 → Realized: 0.1875 (gap: 0.77!)
Predicted avg: 0.82 → Realized: 0.20 (gap: 0.62)
The prompt had no base-rate anchoring and no tail discipline.

4. Stale knowledge cutoff

Hardcoded "October 2023" cutoff — incorrect for current GPT-4.1 model.

5. No base-rate anchoring

~15% of prediction market questions resolve Yes. The prompt gave no guidance on this prior, causing systematic overestimation of p_yes.

6. No absence-of-evidence reasoning

When sources contained no signal that an event had occurred, the model treated this as neutral. The correct interpretation is: absence of evidence IS evidence against.

What Was Changed

max_tokens: 500 → 2000
Removed hardcoded stale knowledge cutoff
Added CALIBRATION RULES: base-rate anchoring (15% Yes prior), evidence requirements for high predictions
Added TAIL DISCIPLINE: p_yes capped at 0.03–0.97, explicit thresholds for >0.80 and >0.90
Added absence-of-evidence reasoning instruction
Fixed output format conflict: reasoning steps come first in XML tags, then JSON output follows naturally after (no "output ONLY JSON" contradiction)

Validation

Extract resolved rows for superforcaster from tournament_predictions.jsonl:

python benchmark/scorer.py --input benchmark/results/tournament_predictions.jsonl --output benchmark/results/baseline_scores.json

Run modified prompt on cached replay dataset:

python benchmark/runner.py --dataset <replay_dataset.jsonl> --tools superforcaster
python benchmark/scorer.py --input benchmark/results/replay_results.jsonl --output benchmark/results/new_scores.json

Compare Brier scores on the same set of markets.

Note: The search/retrieval logic is unchanged — only the prompt and max_tokens were modified, so cached replay is a valid test.

… fix [skip ci] - Increase max_tokens from 500 to 2000 (500 was too low for reasoning + JSON) - Remove stale knowledge cutoff ("October 2023") - Add CALIBRATION RULES: base-rate anchoring (~15% of questions resolve Yes) - Add TAIL DISCIPLINE: constrain p_yes to 0.03-0.97, require evidence for >0.80 - Add absence-of-evidence reasoning (treat no signal as No signal) - Fix conflicting output format: prompt previously asked for XML reasoning AND "output only JSON", causing model to skip reasoning; now JSON follows reasoning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude bot added the auto-improvement Automated tool improvement label Apr 3, 2026

LOCKhart07 closed this Apr 3, 2026

LOCKhart07 deleted the auto-improve/superforcaster branch April 3, 2026 16:15

LOCKhart07 mentioned this pull request Apr 3, 2026

feat: AI-powered tool improvement suggestions via Claude Code Action valory-xyz/mech-predict#183

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve: superforcaster prompt (Brier 0.3897)#2

improve: superforcaster prompt (Brier 0.3897)#2
claude[bot] wants to merge 1 commit intomainfrom
auto-improve/superforcaster

claude bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude bot commented Apr 3, 2026

Current Brier Score

What Was Wrong

1. Conflicting output format (critical)

2. max_tokens=500 (critical)

3. Severe overconfidence at high probabilities

4. Stale knowledge cutoff

5. No base-rate anchoring

6. No absence-of-evidence reasoning

What Was Changed

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant