improve: superforcaster prompt (Brier 0.3897)#2
Closed
claude[bot] wants to merge 1 commit intomainfrom
Closed
Conversation
… fix [skip ci]
- Increase max_tokens from 500 to 2000 (500 was too low for reasoning + JSON)
- Remove stale knowledge cutoff ("October 2023")
- Add CALIBRATION RULES: base-rate anchoring (~15% of questions resolve Yes)
- Add TAIL DISCIPLINE: constrain p_yes to 0.03-0.97, require evidence for >0.80
- Add absence-of-evidence reasoning (treat no signal as No signal)
- Fix conflicting output format: prompt previously asked for XML reasoning AND
"output only JSON", causing model to skip reasoning; now JSON follows reasoning
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Open
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current Brier Score
0.3897 (accuracy 46%, sharpness 0.28, n=59 valid predictions)
What Was Wrong
1. Conflicting output format (critical)
The prompt asked for XML-tagged reasoning steps (steps 1–7) AND then said "Output only the JSON object. Do not include any other contents in your response." This direct contradiction caused the model to skip reasoning and jump straight to JSON, defeating the purpose of chain-of-thought scaffolding.
2. max_tokens=500 (critical)
500 tokens is far too small for 6 reasoning steps with XML tags plus JSON output. This caused truncated responses and parsing failures (59 valid out of 65 = 9% failure rate).
3. Severe overconfidence at high probabilities
Calibration data shows catastrophic overconfidence in the 0.9–1.0 range:
The prompt had no base-rate anchoring and no tail discipline.
4. Stale knowledge cutoff
Hardcoded "October 2023" cutoff — incorrect for current GPT-4.1 model.
5. No base-rate anchoring
~15% of prediction market questions resolve Yes. The prompt gave no guidance on this prior, causing systematic overestimation of p_yes.
6. No absence-of-evidence reasoning
When sources contained no signal that an event had occurred, the model treated this as neutral. The correct interpretation is: absence of evidence IS evidence against.
What Was Changed
max_tokens: 500 → 2000Validation
Note: The search/retrieval logic is unchanged — only the prompt and max_tokens were modified, so cached replay is a valid test.