fix(superforcaster): switch to OpenAI Structured Outputs (#221) by LOCKhart07 · Pull Request #231 · valory-xyz/mech-predict

LOCKhart07 · 2026-04-16T20:18:15Z

Closes #221.

Summary

superforcaster delivers: swap the contradictory JSON-in-prompt path (7-step XML reasoning chain + "output only JSON") for OpenAI Structured Outputs against a PredictionResult pydantic schema. The model physically cannot return free-form text anymore — the <facts>-leak mode that hit 25–30%/day on Gnosis (since 2026-04-08) is no longer representable. Superforecaster methodology (CALIBRATION base rate, EVIDENCE BAR, CONFIDENCE COUPLING, NUMERIC QUESTIONS, bias adjustments) is preserved verbatim — only the XML scaffolding and the per-step structural distinctness of steps 5 and 7 were compressed. See the commit body on 0bdce2cf for methodology diff.
benchmark/prompt_replay: add _call_openai_structured + a tool_name → schema class registry so tools using structured outputs (superforcaster now, factual_research already) can be replayed honestly instead of fallen-through to chat.completions.create.
benchmark/prompt_replay + benchmark/ci_replay: surface parse reliability as a first-class metric. superforcaster: <facts> reasoning leaks into toolResponse (~25–30% of deliveries, stepped up 2026-04-08) #221 was invisible to every existing dashboard because an on-chain deliver with a malformed toolResponse counted as a success — Brier and accuracy both skipped it. The replay summary and PR comment now report valid/total with a 4-bucket breakdown and flag any candidate drop vs baseline (⚠️); failure bodies are persisted to candidate_failures.jsonl and inlined in a collapsed <details> in the PR comment for forensic inspection.

On-chain output is unchanged — only the four standard mech fields (p_yes, p_no, confidence, info_utility) are serialised.

Test plan

pytest packages/valory/customs/superforcaster/tests — 10/10 green (schema contract, on-chain shape, validator, source_content replay)
pytest benchmark/tests — 307/307 green, including 10 new tests for ci_replay reliability rendering
Local prompt_replay on N=11 stratified sample: 11/11 candidate parse success (vs the 25-30% leak rate in prod)
Larger CI sweep via /benchmark superforcaster — see next comment

The self-contradicting prompt (7-step XML reasoning chain then "output only JSON") let the model leak reasoning into toolResponse at 25-30%/day since the 2026-04-06 max_tokens bump. Use client.beta.chat.completions.parse with a PredictionResult pydantic schema so the model physically cannot return free-form text. The reasoning chain survives as separate schema fields (facts, reasons_no, reasons_yes, aggregation, reflection); on-chain result is unchanged - only the four standard mech fields (p_yes, p_no, confidence, info_utility) are serialised. Prompt methodology (calibration, evidence bar, confidence coupling, numeric-question check) is preserved verbatim; XML tag delimiters dropped. Retries now cover pydantic ValidationError (e.g. p_yes + p_no sum check), network blips and OpenAI-side transient errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e path Tools using OpenAI Structured Outputs (superforcaster, factual_research) can't be replayed through the plain chat.completions.create helper - their prompts no longer contain format directives, so the model returns free-form text and candidate parse fails. Add _call_openai_structured() that takes the caller's Pydantic schema and returns a JSON string of only the four on-chain fields, so the downstream parse_response stays tool-agnostic. A small name->schema registry (_STRUCTURED_OUTPUT_SCHEMAS) lets replay() dispatch by tool name - adding another structured-output tool is one registry entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Issue #221 was invisible to every existing dashboard because an on-chain deliver with a malformed toolResponse counted as a success - Brier and accuracy both skipped it silently. Same blind spot existed in the prompt_replay summary, which only reported "N candidate scored" buried in a parenthetical. Track prediction_parse_status per candidate during the replay loop, emit an explicit "Parse reliability" block (valid/total, 4-bucket breakdown, delta vs baseline) above the existing Brier block, and persist any non-valid responses to candidate_failures.jsonl alongside baseline.jsonl / candidate.jsonl for forensic inspection. No exit-code change - regression stays visible-but-not-blocking for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends compute_metrics with a parse_reliability block (total, valid, parse_rate, 4-bucket breakdown) and renders it above the Brier table in the PR comment. Reads candidate_failures.jsonl (written by prompt_replay when any candidate fails to parse) and inlines up to 5 failure bodies in a collapsed <details> so leaks like #221's <facts>-leak are diagnosable from the PR thread without hunting through CI logs. Candidate parse-rate drop vs baseline is flagged with ⚠️; no change is ✅. Body content is backtick-escaped to prevent code-fence breakout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- black/isort re-wraps across the three benchmark files touched - class docstrings on the three TestCi* classes - drop unused json import, capitalise D403 docstring - add full darglint param/return/raises to _parse_completion Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LOCKhart07 · 2026-04-16T20:21:56Z

/benchmark superforcaster --sample 50

github-actions · 2026-04-16T20:42:02Z

Benchmark: superforcaster

Parse reliability

Baseline: 100/100 (100.0%) [valid-only by filter]
Candidate: 100/100 (100.0%) — ✅ same as baseline
Candidate breakdown: valid=100, missing_fields=0, malformed=0, error=0

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.2537	0.2362	-6.9%
Directional Accuracy	68.0%	69.0%	+1.5%
Overconf-wrong	24	21	-12.5%
Overconf-wrong rate	0.2400	0.2100	-12.5%

Per-platform breakdown

Omen (n=50)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.1924	0.1434	-25.5%
Directional Accuracy	76.0%	84.0%	+10.5%
Overconf-wrong	10	7	-30.0%
Overconf-wrong rate	0.2000	0.1400	-30.0%

Polymarket (n=50)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.3150	0.3290	+4.5%
Directional Accuracy	60.0%	54.0%	-10.0%
Overconf-wrong	14	14	+0.0%
Overconf-wrong rate	0.2800	0.2800	+0.0%

100 markets | triggered by @LOCKhart07

bennyjo

Three suggestions on the benchmark-side plumbing: schema/prompt step-numbering mismatch, unescaped question_text in the PR comment, and a hard-coded baseline_parse_rate=1.0 that can silently lie. Core fix (Structured Outputs) looks good and should close #221.

Schema descriptions no longer carry "Step N —" prefixes; the prompt body remains the single source of ordering. Addresses drift where reflection was labelled Step 6 in the schema but Step 5 in the prompt body after steps 5 and 7 were collapsed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wrap candidate-failure question text in inline code and replace literal backticks, matching the existing defence on raw_response. Prevents a crafted on-chain question containing </details>, backticks, or HTML from breaking the collapsed <details> block in the public PR comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visible Replace hardcoded ``baseline_parse_rate = 1.0`` with an accounted invariant. ``_load_and_filter_rows`` now returns per-reason rejection counts; ``enrich`` persists them as a ``{output}.filter_stats.json`` sidecar; ``_log_replay_summary`` and the ci_replay PR comment render a Pre-filter block with a warning marker when ``not_valid_parse`` is nonzero. The other four rejection buckets are expected sample scoping and stay informational only. The replay summary still declares baseline as 100% by construction — but now there is an independent observation that would surface a regression in the upstream "drop non-valid parses" behaviour, instead of silently repeating the same #221 failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Reliability section Baseline parse rate is always 100% by construction (the enrich filter drops non-valid rows) so a baseline-vs-candidate Parse reliability block compares a tautology against a measurement. Drop that framing: move the primary metrics table to the top of the PR comment, then render a single Reliability section below with two one-sided bullets — - Candidate parse rate: N/M (X.X%) ✅|⚠️ - Pre-filter (enrich): A accepted, R rejected, not_valid_parse=N ✅|⚠️ Breakdown and scoping lines only surface on drift / rejections, so the happy path stays tight. The comparison table now leads the comment, which is what reviewers actually scan first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mariapiamo

P2
Stale parse-failure artifacts can leak into new reports
Ignora
candidate_failures.jsonl is only written when failures exist and is never removed when a run has zero failures. Because ci_replay.py reads this file whenever it exists, reusing an output directory can make a clean run report old failures as if they were current.

P3
Pre-filter stats can become stale across runs
Ignora
filter_stats.json is only written when sidecar stats are present. If a previous run left this file behind and a later run has no sidecar, ci_replay.py can still load old stats and show misleading pre-filter numbers for the current benchmark.

Note re ci_replay

I know the regression set is implemented, but in ci_replay both baseline and candidate are 100/100 parse-valid, so this is a ceiling result and doesn’t prove improved reliability. Can we increase the replay sample (e.g., 200 per platform) and run 2–3 seeds for better comparison power? Also, please report parse-valid / total attempted calls before filtering.
Practical target:

200+ per platform for PR runs.
3 seeds (42, 1337, 2026).
Keep regression-set run separate from random-sample run.

candidate_failures.jsonl and filter_stats.json are written conditionally (only when failures / stats exist), so a reused output_dir would leak a prior run's sidecars into ci_replay and surface them as current. Prep now unlinks both before each run, keeping "files in output_dir correspond to this run only" as an invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LOCKhart07 · 2026-04-17T08:18:08Z

@mariapiamo both P2 and P3 are the same root-cause bug — addressed in dec17aa.

P2 candidate_failures.jsonl is only written when failures exist and is never removed when a run has zero failures. Because ci_replay.py reads this file whenever it exists, reusing an output directory can make a clean run report old failures as if they were current.

P3 filter_stats.json is only written when sidecar stats are present. If a previous run left this file behind and a later run has no sidecar, ci_replay.py can still load old stats and show misleading pre-filter numbers for the current benchmark.

Root cause: output_dir.mkdir(parents=True, exist_ok=True) preserved prior-run contents, and both sidecars are written only on non-empty paths, so a clean run silently inherited stale files.

Fix. Extracted the output-dir prep into _prepare_output_dir(output_dir) in benchmark/prompt_replay.py, which now mkdirs and then unlink(missing_ok=True)s both sidecars at the top of every replay. Keeps the invariant "files in output_dir correspond to this run only" at the exact seam where it was being broken.

Coverage (TDD): two new tests in TestPrepareOutputDir (seed stale sidecar → call prep → assert absent) failed against the old code and pass now; plus guard tests that unrelated artifacts like enriched_with_new_reasoning.jsonl are preserved and that prep is a no-op when sidecars are absent. Full benchmark/tests/ at 465/465.

Rejected alternatives.

Always-write with empty body — lies about filter_stats.json: missing legitimately means "older pipeline without the enrich sidecar" (see the comment at ci_replay.py:476), so writing {} would claim we have stats we don't.
Nuke the whole output_dir — would also wipe enriched_with_new_reasoning.jsonl that users may deliberately chain between iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LOCKhart07 · 2026-04-17T09:47:02Z

/benchmark superforcaster --sample 200

^ Did not finish as the run hit timeout due to number of markets

LOCKhart07 · 2026-04-17T10:39:06Z

/benchmark superforcaster --sample 100 --seed 1337

LOCKhart07 · 2026-04-17T10:39:16Z

/benchmark superforcaster --sample 100 --seed 2026

bennyjo

Re-reviewed on d6fd4f0. All three prior concerns resolved: schema step-numbering drift stripped at source, question_text backtick-escaped + inline-coded, and hard-coded baseline_parse_rate=1.0 now paired with an independent filter_stats.json sidecar that flags ⚠️ if the upstream invariant regresses. Stale-sidecar purge (dec17aa) closes mariapiamo's P2/P3. Critical fix — ship it.

LOCKhart07 · 2026-04-17T11:00:49Z

@mariapiamo thanks — all three asks (larger sample, multi-seed, parse-valid / total-attempted before filtering) are fair as benchmark-infrastructure upgrades. Scoped them out of this PR into #233 so the #221 fix can land without pulling in the broader overhaul:

render baseline pre-filter parse rate in the Reliability block
thread --seed into the comment footer for legibility across multi-seed runs
first-class multi-seed benchmark runs with aggregated output
curated regression-set run separate from random-sample run

In the meantime I've triggered two extra runs on this PR at --sample 100 with seeds 1337 and 2026. Between those and the original --sample 50 --seed 42 run we'll have three independent data points on candidate parse reliability — will post the numbers once the two new comments land.

On the "ceiling result" framing: agreed for the baseline side — 100% is a tautology because the enrich step pre-filters to valid rows. For the candidate side on this specific PR the guarantee is architectural rather than statistical: OpenAI Structured Outputs against the PredictionResult pydantic schema physically can't emit the XML-scaffolding → free-form JSON mode that #221 was seeing in prod. So 100/100 (and the upcoming 200/200 × 2 seeds) is less "our point estimate happens to be 100%" and more "does this failure mode still exist at all?" — any single non-parseable candidate across 3 × 200 = 600 calls would show up here.

The general upgrades in #233 still apply to everything the benchmark suite does after this PR, where the claims will be probabilistic rather than architectural.

github-actions · 2026-04-17T11:14:53Z

Benchmark: superforcaster

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.3203	0.2802	-12.5%
Directional Accuracy	59.5%	64.7%	+8.7%
Overconf-wrong	54	43	-20.4%
Overconf-wrong rate	0.2687	0.2139	-20.4%

Reliability

Candidate parse rate: 201/201 (100.0%) ✅
Pre-filter (enrich): 1070 accepted, 1223 rejected, not_valid_parse=0 ✅
- Scoping: wrong_tool=1223, no_deliver_id=0, no_outcome=0, older_than_cutoff=0

Per-platform breakdown

Omen (n=100)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.2908	0.2809	-3.4%
Directional Accuracy	69.0%	70.0%	+1.4%
Overconf-wrong	31	30	-3.2%
Overconf-wrong rate	0.3100	0.3000	-3.2%

Polymarket (n=101)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.3496	0.2795	-20.1%
Directional Accuracy	50.0%	59.4%	+18.8%
Overconf-wrong	23	13	-43.5%
Overconf-wrong rate	0.2277	0.1287	-43.5%

201 markets | triggered by @LOCKhart07

github-actions · 2026-04-17T11:17:02Z

Benchmark: superforcaster

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.3082	0.2616	-15.1%
Directional Accuracy	60.7%	64.7%	+6.6%
Overconf-wrong	50	36	-28.0%
Overconf-wrong rate	0.2488	0.1791	-28.0%

Reliability

Candidate parse rate: 201/201 (100.0%) ✅
Pre-filter (enrich): 1070 accepted, 1223 rejected, not_valid_parse=0 ✅
- Scoping: wrong_tool=1223, no_deliver_id=0, no_outcome=0, older_than_cutoff=0

Per-platform breakdown

Omen (n=100)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.2796	0.2385	-14.7%
Directional Accuracy	69.0%	74.0%	+7.2%
Overconf-wrong	31	26	-16.1%
Overconf-wrong rate	0.3100	0.2600	-16.1%

Polymarket (n=101)

Metric	Baseline (prod)	Candidate (PR)	Delta
Brier score	0.3365	0.2845	-15.4%
Directional Accuracy	52.5%	55.4%	+5.7%
Overconf-wrong	19	10	-47.4%
Overconf-wrong rate	0.1881	0.0990	-47.4%

201 markets | triggered by @LOCKhart07

@user

…comment Two reliability-labeling improvements to the /benchmark PR comment, per #231 review feedback and the follow-ups in #233: 1. Render baseline pre-filter parse rate alongside the existing Pre-filter line. Post-filter baseline=100% is a tautology because enrich drops non-valid rows; the pre-filter ratio (accepted / (accepted + not_valid_parse)) is what tells reviewers how noisy production actually was. 2. Thread --seed and --trigger-comment-url through the workflow into ci_replay so multi-seed runs posted by different triggering comments are distinguishable in-place. Seed lands in the footer (already rendered if present in meta, just never populated); trigger-comment URL wraps the @user mention in a markdown link back to the originating comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LOCKhart07 and others added 4 commits April 17, 2026 01:21

LOCKhart07 self-assigned this Apr 16, 2026

LOCKhart07 and others added 2 commits April 17, 2026 01:50

chore(packages): re-lock hashes after superforcaster docstring edit

ef8afa0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bennyjo reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/valory/customs/superforcaster/superforcaster.py Outdated

Comment thread benchmark/ci_replay.py Outdated

Comment thread benchmark/prompt_replay.py

LOCKhart07 and others added 5 commits April 17, 2026 03:25

chore(packages): re-lock hashes after superforcaster description edit

023b752

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mariapiamo requested changes Apr 17, 2026

View reviewed changes

chore(lint): pass darglint on TestPrepareOutputDir docstring

d6fd4f0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mariapiamo approved these changes Apr 17, 2026

View reviewed changes

bennyjo approved these changes Apr 17, 2026

View reviewed changes

LOCKhart07 mentioned this pull request Apr 17, 2026

benchmark: statistical robustness follow-ups from PR #231 review #233

Open

LOCKhart07 merged commit eb69009 into main Apr 17, 2026
16 checks passed

LOCKhart07 deleted the fix/221-superforcaster-structured-outputs branch April 17, 2026 12:13

LOCKhart07 mentioned this pull request Apr 17, 2026

feat(benchmark): surface pre-filter parse rate + seed/trigger attribution in PR comment #235

Open

3 tasks

Conversation

LOCKhart07 commented Apr 16, 2026

Summary

Test plan

Uh oh!

LOCKhart07 commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Benchmark: superforcaster

Omen (n=50)

Polymarket (n=50)

Uh oh!

bennyjo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mariapiamo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LOCKhart07 commented Apr 17, 2026

Uh oh!

LOCKhart07 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LOCKhart07 commented Apr 17, 2026

Uh oh!

LOCKhart07 commented Apr 17, 2026

Uh oh!

bennyjo left a comment

Choose a reason for hiding this comment

Uh oh!

LOCKhart07 commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Benchmark: superforcaster

Omen (n=100)

Polymarket (n=101)

Uh oh!

github-actions bot commented Apr 17, 2026

Benchmark: superforcaster

Omen (n=100)

Polymarket (n=101)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mariapiamo left a comment •

edited

Loading

LOCKhart07 commented Apr 17, 2026 •

edited

Loading