fix(superforcaster): switch to OpenAI Structured Outputs (#221)#231
fix(superforcaster): switch to OpenAI Structured Outputs (#221)#231LOCKhart07 merged 13 commits intomainfrom
Conversation
The self-contradicting prompt (7-step XML reasoning chain then "output only JSON") let the model leak reasoning into toolResponse at 25-30%/day since the 2026-04-06 max_tokens bump. Use client.beta.chat.completions.parse with a PredictionResult pydantic schema so the model physically cannot return free-form text. The reasoning chain survives as separate schema fields (facts, reasons_no, reasons_yes, aggregation, reflection); on-chain result is unchanged - only the four standard mech fields (p_yes, p_no, confidence, info_utility) are serialised. Prompt methodology (calibration, evidence bar, confidence coupling, numeric-question check) is preserved verbatim; XML tag delimiters dropped. Retries now cover pydantic ValidationError (e.g. p_yes + p_no sum check), network blips and OpenAI-side transient errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e path Tools using OpenAI Structured Outputs (superforcaster, factual_research) can't be replayed through the plain chat.completions.create helper - their prompts no longer contain format directives, so the model returns free-form text and candidate parse fails. Add _call_openai_structured() that takes the caller's Pydantic schema and returns a JSON string of only the four on-chain fields, so the downstream parse_response stays tool-agnostic. A small name->schema registry (_STRUCTURED_OUTPUT_SCHEMAS) lets replay() dispatch by tool name - adding another structured-output tool is one registry entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #221 was invisible to every existing dashboard because an on-chain deliver with a malformed toolResponse counted as a success - Brier and accuracy both skipped it silently. Same blind spot existed in the prompt_replay summary, which only reported "N candidate scored" buried in a parenthetical. Track prediction_parse_status per candidate during the replay loop, emit an explicit "Parse reliability" block (valid/total, 4-bucket breakdown, delta vs baseline) above the existing Brier block, and persist any non-valid responses to candidate_failures.jsonl alongside baseline.jsonl / candidate.jsonl for forensic inspection. No exit-code change - regression stays visible-but-not-blocking for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends compute_metrics with a parse_reliability block (total, valid, parse_rate, 4-bucket breakdown) and renders it above the Brier table in the PR comment. Reads candidate_failures.jsonl (written by prompt_replay when any candidate fails to parse) and inlines up to 5 failure bodies in a collapsed <details> so leaks like #221's <facts>-leak are diagnosable from the PR thread without hunting through CI logs. Candidate parse-rate drop vs baseline is flagged with⚠️ ; no change is ✅. Body content is backtick-escaped to prevent code-fence breakout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- black/isort re-wraps across the three benchmark files touched - class docstrings on the three TestCi* classes - drop unused json import, capitalise D403 docstring - add full darglint param/return/raises to _parse_completion Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/benchmark superforcaster --sample 50 |
Benchmark: superforcasterParse reliability
Per-platform breakdownOmen (n=50)
Polymarket (n=50)
100 markets | triggered by @LOCKhart07 |
bennyjo
left a comment
There was a problem hiding this comment.
Three suggestions on the benchmark-side plumbing: schema/prompt step-numbering mismatch, unescaped question_text in the PR comment, and a hard-coded baseline_parse_rate=1.0 that can silently lie. Core fix (Structured Outputs) looks good and should close #221.
Schema descriptions no longer carry "Step N —" prefixes; the prompt body remains the single source of ordering. Addresses drift where reflection was labelled Step 6 in the schema but Step 5 in the prompt body after steps 5 and 7 were collapsed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap candidate-failure question text in inline code and replace literal backticks, matching the existing defence on raw_response. Prevents a crafted on-chain question containing </details>, backticks, or HTML from breaking the collapsed <details> block in the public PR comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visible
Replace hardcoded ``baseline_parse_rate = 1.0`` with an accounted
invariant. ``_load_and_filter_rows`` now returns per-reason rejection
counts; ``enrich`` persists them as a ``{output}.filter_stats.json``
sidecar; ``_log_replay_summary`` and the ci_replay PR comment render
a Pre-filter block with a warning marker when ``not_valid_parse`` is
nonzero. The other four rejection buckets are expected sample scoping
and stay informational only.
The replay summary still declares baseline as 100% by construction —
but now there is an independent observation that would surface a
regression in the upstream "drop non-valid parses" behaviour, instead
of silently repeating the same #221 failure mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Reliability section Baseline parse rate is always 100% by construction (the enrich filter drops non-valid rows) so a baseline-vs-candidate Parse reliability block compares a tautology against a measurement. Drop that framing: move the primary metrics table to the top of the PR comment, then render a single Reliability section below with two one-sided bullets — - Candidate parse rate: N/M (X.X%) ✅|⚠️ - Pre-filter (enrich): A accepted, R rejected, not_valid_parse=N ✅|⚠️ Breakdown and scoping lines only surface on drift / rejections, so the happy path stays tight. The comparison table now leads the comment, which is what reviewers actually scan first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
P2
Stale parse-failure artifacts can leak into new reports
Ignora
candidate_failures.jsonl is only written when failures exist and is never removed when a run has zero failures. Because ci_replay.py reads this file whenever it exists, reusing an output directory can make a clean run report old failures as if they were current.
P3
Pre-filter stats can become stale across runs
Ignora
filter_stats.json is only written when sidecar stats are present. If a previous run left this file behind and a later run has no sidecar, ci_replay.py can still load old stats and show misleading pre-filter numbers for the current benchmark.
Note re ci_replay
I know the regression set is implemented, but in ci_replay both baseline and candidate are 100/100 parse-valid, so this is a ceiling result and doesn’t prove improved reliability. Can we increase the replay sample (e.g., 200 per platform) and run 2–3 seeds for better comparison power? Also, please report parse-valid / total attempted calls before filtering.
Practical target:
- 200+ per platform for PR runs.
- 3 seeds (42, 1337, 2026).
- Keep regression-set run separate from random-sample run.
candidate_failures.jsonl and filter_stats.json are written conditionally (only when failures / stats exist), so a reused output_dir would leak a prior run's sidecars into ci_replay and surface them as current. Prep now unlinks both before each run, keeping "files in output_dir correspond to this run only" as an invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@mariapiamo both P2 and P3 are the same root-cause bug — addressed in dec17aa.
Root cause: Fix. Extracted the output-dir prep into Coverage (TDD): two new tests in Rejected alternatives.
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/benchmark superforcaster --sample 200 ^ Did not finish as the run hit timeout due to number of markets |
|
/benchmark superforcaster --sample 100 --seed 1337 |
|
/benchmark superforcaster --sample 100 --seed 2026 |
bennyjo
left a comment
There was a problem hiding this comment.
Re-reviewed on d6fd4f0. All three prior concerns resolved: schema step-numbering drift stripped at source, question_text backtick-escaped + inline-coded, and hard-coded baseline_parse_rate=1.0 now paired with an independent filter_stats.json sidecar that flags
|
@mariapiamo thanks — all three asks (larger sample, multi-seed, parse-valid / total-attempted before filtering) are fair as benchmark-infrastructure upgrades. Scoped them out of this PR into #233 so the #221 fix can land without pulling in the broader overhaul:
In the meantime I've triggered two extra runs on this PR at On the "ceiling result" framing: agreed for the baseline side — 100% is a tautology because the enrich step pre-filters to valid rows. For the candidate side on this specific PR the guarantee is architectural rather than statistical: OpenAI Structured Outputs against the The general upgrades in #233 still apply to everything the benchmark suite does after this PR, where the claims will be probabilistic rather than architectural. |
Benchmark: superforcaster
Reliability
Per-platform breakdownOmen (n=100)
Polymarket (n=101)
201 markets | triggered by @LOCKhart07 |
Benchmark: superforcaster
Reliability
Per-platform breakdownOmen (n=100)
Polymarket (n=101)
201 markets | triggered by @LOCKhart07 |
…comment Two reliability-labeling improvements to the /benchmark PR comment, per #231 review feedback and the follow-ups in #233: 1. Render baseline pre-filter parse rate alongside the existing Pre-filter line. Post-filter baseline=100% is a tautology because enrich drops non-valid rows; the pre-filter ratio (accepted / (accepted + not_valid_parse)) is what tells reviewers how noisy production actually was. 2. Thread --seed and --trigger-comment-url through the workflow into ci_replay so multi-seed runs posted by different triggering comments are distinguishable in-place. Seed lands in the footer (already rendered if present in meta, just never populated); trigger-comment URL wraps the @user mention in a markdown link back to the originating comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #221.
Summary
superforcasterdelivers: swap the contradictory JSON-in-prompt path (7-step XML reasoning chain + "output only JSON") for OpenAI Structured Outputs against aPredictionResultpydantic schema. The model physically cannot return free-form text anymore — the<facts>-leak mode that hit 25–30%/day on Gnosis (since 2026-04-08) is no longer representable. Superforecaster methodology (CALIBRATION base rate, EVIDENCE BAR, CONFIDENCE COUPLING, NUMERIC QUESTIONS, bias adjustments) is preserved verbatim — only the XML scaffolding and the per-step structural distinctness of steps 5 and 7 were compressed. See the commit body on0bdce2cffor methodology diff.benchmark/prompt_replay: add_call_openai_structured+ atool_name → schema classregistry so tools using structured outputs (superforcaster now,factual_researchalready) can be replayed honestly instead of fallen-through tochat.completions.create.benchmark/prompt_replay+benchmark/ci_replay: surface parse reliability as a first-class metric. superforcaster: <facts> reasoning leaks into toolResponse (~25–30% of deliveries, stepped up 2026-04-08) #221 was invisible to every existing dashboard because an on-chain deliver with a malformedtoolResponsecounted as a success — Brier and accuracy both skipped it. The replay summary and PR comment now reportvalid/totalwith a 4-bucket breakdown and flag any candidate drop vs baseline (candidate_failures.jsonland inlined in a collapsed<details>in the PR comment for forensic inspection.On-chain output is unchanged — only the four standard mech fields (
p_yes,p_no,confidence,info_utility) are serialised.Test plan
pytest packages/valory/customs/superforcaster/tests— 10/10 green (schema contract, on-chain shape, validator, source_content replay)pytest benchmark/tests— 307/307 green, including 10 new tests forci_replayreliability renderingprompt_replayon N=11 stratified sample: 11/11 candidate parse success (vs the 25-30% leak rate in prod)/benchmark superforcaster— see next comment