Skip to content

improve: superforcaster prompt (Brier 0.3897)#2

Closed
claude[bot] wants to merge 1 commit intomainfrom
auto-improve/superforcaster
Closed

improve: superforcaster prompt (Brier 0.3897)#2
claude[bot] wants to merge 1 commit intomainfrom
auto-improve/superforcaster

Conversation

@claude
Copy link
Copy Markdown

@claude claude bot commented Apr 3, 2026

Current Brier Score

0.3897 (accuracy 46%, sharpness 0.28, n=59 valid predictions)

What Was Wrong

1. Conflicting output format (critical)

The prompt asked for XML-tagged reasoning steps (steps 1–7) AND then said "Output only the JSON object. Do not include any other contents in your response." This direct contradiction caused the model to skip reasoning and jump straight to JSON, defeating the purpose of chain-of-thought scaffolding.

2. max_tokens=500 (critical)

500 tokens is far too small for 6 reasoning steps with XML tags plus JSON output. This caused truncated responses and parsing failures (59 valid out of 65 = 9% failure rate).

3. Severe overconfidence at high probabilities

Calibration data shows catastrophic overconfidence in the 0.9–1.0 range:

  • Predicted avg: 0.955 → Realized: 0.1875 (gap: 0.77!)
  • Predicted avg: 0.82 → Realized: 0.20 (gap: 0.62)
    The prompt had no base-rate anchoring and no tail discipline.

4. Stale knowledge cutoff

Hardcoded "October 2023" cutoff — incorrect for current GPT-4.1 model.

5. No base-rate anchoring

~15% of prediction market questions resolve Yes. The prompt gave no guidance on this prior, causing systematic overestimation of p_yes.

6. No absence-of-evidence reasoning

When sources contained no signal that an event had occurred, the model treated this as neutral. The correct interpretation is: absence of evidence IS evidence against.

What Was Changed

  • max_tokens: 500 → 2000
  • Removed hardcoded stale knowledge cutoff
  • Added CALIBRATION RULES: base-rate anchoring (15% Yes prior), evidence requirements for high predictions
  • Added TAIL DISCIPLINE: p_yes capped at 0.03–0.97, explicit thresholds for >0.80 and >0.90
  • Added absence-of-evidence reasoning instruction
  • Fixed output format conflict: reasoning steps come first in XML tags, then JSON output follows naturally after (no "output ONLY JSON" contradiction)

Validation

  1. Extract resolved rows for superforcaster from tournament_predictions.jsonl:
    python benchmark/scorer.py --input benchmark/results/tournament_predictions.jsonl --output benchmark/results/baseline_scores.json
  2. Run modified prompt on cached replay dataset:
    python benchmark/runner.py --dataset <replay_dataset.jsonl> --tools superforcaster
    python benchmark/scorer.py --input benchmark/results/replay_results.jsonl --output benchmark/results/new_scores.json
  3. Compare Brier scores on the same set of markets.

Note: The search/retrieval logic is unchanged — only the prompt and max_tokens were modified, so cached replay is a valid test.

… fix [skip ci]

- Increase max_tokens from 500 to 2000 (500 was too low for reasoning + JSON)
- Remove stale knowledge cutoff ("October 2023")
- Add CALIBRATION RULES: base-rate anchoring (~15% of questions resolve Yes)
- Add TAIL DISCIPLINE: constrain p_yes to 0.03-0.97, require evidence for >0.80
- Add absence-of-evidence reasoning (treat no signal as No signal)
- Fix conflicting output format: prompt previously asked for XML reasoning AND
  "output only JSON", causing model to skip reasoning; now JSON follows reasoning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude claude bot added the auto-improvement Automated tool improvement label Apr 3, 2026
@LOCKhart07 LOCKhart07 closed this Apr 3, 2026
@LOCKhart07 LOCKhart07 deleted the auto-improve/superforcaster branch April 3, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-improvement Automated tool improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant