Skip to content

fix(superforcaster): switch to OpenAI Structured Outputs (#221)#231

Merged
LOCKhart07 merged 13 commits intomainfrom
fix/221-superforcaster-structured-outputs
Apr 17, 2026
Merged

fix(superforcaster): switch to OpenAI Structured Outputs (#221)#231
LOCKhart07 merged 13 commits intomainfrom
fix/221-superforcaster-structured-outputs

Conversation

@LOCKhart07
Copy link
Copy Markdown
Member

Closes #221.

Summary

  • superforcaster delivers: swap the contradictory JSON-in-prompt path (7-step XML reasoning chain + "output only JSON") for OpenAI Structured Outputs against a PredictionResult pydantic schema. The model physically cannot return free-form text anymore — the <facts>-leak mode that hit 25–30%/day on Gnosis (since 2026-04-08) is no longer representable. Superforecaster methodology (CALIBRATION base rate, EVIDENCE BAR, CONFIDENCE COUPLING, NUMERIC QUESTIONS, bias adjustments) is preserved verbatim — only the XML scaffolding and the per-step structural distinctness of steps 5 and 7 were compressed. See the commit body on 0bdce2cf for methodology diff.
  • benchmark/prompt_replay: add _call_openai_structured + a tool_name → schema class registry so tools using structured outputs (superforcaster now, factual_research already) can be replayed honestly instead of fallen-through to chat.completions.create.
  • benchmark/prompt_replay + benchmark/ci_replay: surface parse reliability as a first-class metric. superforcaster: <facts> reasoning leaks into toolResponse (~25–30% of deliveries, stepped up 2026-04-08) #221 was invisible to every existing dashboard because an on-chain deliver with a malformed toolResponse counted as a success — Brier and accuracy both skipped it. The replay summary and PR comment now report valid/total with a 4-bucket breakdown and flag any candidate drop vs baseline (⚠️); failure bodies are persisted to candidate_failures.jsonl and inlined in a collapsed <details> in the PR comment for forensic inspection.

On-chain output is unchanged — only the four standard mech fields (p_yes, p_no, confidence, info_utility) are serialised.

Test plan

  • pytest packages/valory/customs/superforcaster/tests — 10/10 green (schema contract, on-chain shape, validator, source_content replay)
  • pytest benchmark/tests — 307/307 green, including 10 new tests for ci_replay reliability rendering
  • Local prompt_replay on N=11 stratified sample: 11/11 candidate parse success (vs the 25-30% leak rate in prod)
  • Larger CI sweep via /benchmark superforcaster — see next comment

LOCKhart07 and others added 4 commits April 17, 2026 01:21
The self-contradicting prompt (7-step XML reasoning chain then "output
only JSON") let the model leak reasoning into toolResponse at 25-30%/day
since the 2026-04-06 max_tokens bump.

Use client.beta.chat.completions.parse with a PredictionResult pydantic
schema so the model physically cannot return free-form text. The
reasoning chain survives as separate schema fields (facts, reasons_no,
reasons_yes, aggregation, reflection); on-chain result is unchanged -
only the four standard mech fields (p_yes, p_no, confidence,
info_utility) are serialised. Prompt methodology (calibration, evidence
bar, confidence coupling, numeric-question check) is preserved verbatim;
XML tag delimiters dropped.

Retries now cover pydantic ValidationError (e.g. p_yes + p_no sum
check), network blips and OpenAI-side transient errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e path

Tools using OpenAI Structured Outputs (superforcaster, factual_research)
can't be replayed through the plain chat.completions.create helper -
their prompts no longer contain format directives, so the model returns
free-form text and candidate parse fails.

Add _call_openai_structured() that takes the caller's Pydantic schema
and returns a JSON string of only the four on-chain fields, so the
downstream parse_response stays tool-agnostic. A small name->schema
registry (_STRUCTURED_OUTPUT_SCHEMAS) lets replay() dispatch by tool
name - adding another structured-output tool is one registry entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #221 was invisible to every existing dashboard because an
on-chain deliver with a malformed toolResponse counted as a success -
Brier and accuracy both skipped it silently. Same blind spot existed in
the prompt_replay summary, which only reported "N candidate scored"
buried in a parenthetical.

Track prediction_parse_status per candidate during the replay loop,
emit an explicit "Parse reliability" block (valid/total, 4-bucket
breakdown, delta vs baseline) above the existing Brier block, and
persist any non-valid responses to candidate_failures.jsonl alongside
baseline.jsonl / candidate.jsonl for forensic inspection. No exit-code
change - regression stays visible-but-not-blocking for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends compute_metrics with a parse_reliability block (total, valid,
parse_rate, 4-bucket breakdown) and renders it above the Brier table in
the PR comment. Reads candidate_failures.jsonl (written by prompt_replay
when any candidate fails to parse) and inlines up to 5 failure bodies in
a collapsed <details> so leaks like #221's <facts>-leak are diagnosable
from the PR thread without hunting through CI logs.

Candidate parse-rate drop vs baseline is flagged with ⚠️; no change is
✅. Body content is backtick-escaped to prevent code-fence breakout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LOCKhart07 LOCKhart07 self-assigned this Apr 16, 2026
LOCKhart07 and others added 2 commits April 17, 2026 01:50
- black/isort re-wraps across the three benchmark files touched
- class docstrings on the three TestCi* classes
- drop unused json import, capitalise D403 docstring
- add full darglint param/return/raises to _parse_completion

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LOCKhart07
Copy link
Copy Markdown
Member Author

/benchmark superforcaster --sample 50

@github-actions
Copy link
Copy Markdown

Benchmark: superforcaster

Parse reliability

  • Baseline: 100/100 (100.0%) [valid-only by filter]
  • Candidate: 100/100 (100.0%) — ✅ same as baseline
  • Candidate breakdown: valid=100, missing_fields=0, malformed=0, error=0
Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.2537 0.2362 -6.9%
Directional Accuracy 68.0% 69.0% +1.5%
Overconf-wrong 24 21 -12.5%
Overconf-wrong rate 0.2400 0.2100 -12.5%
Per-platform breakdown

Omen (n=50)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.1924 0.1434 -25.5%
Directional Accuracy 76.0% 84.0% +10.5%
Overconf-wrong 10 7 -30.0%
Overconf-wrong rate 0.2000 0.1400 -30.0%

Polymarket (n=50)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.3150 0.3290 +4.5%
Directional Accuracy 60.0% 54.0% -10.0%
Overconf-wrong 14 14 +0.0%
Overconf-wrong rate 0.2800 0.2800 +0.0%

100 markets | triggered by @LOCKhart07

Copy link
Copy Markdown
Collaborator

@bennyjo bennyjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three suggestions on the benchmark-side plumbing: schema/prompt step-numbering mismatch, unescaped question_text in the PR comment, and a hard-coded baseline_parse_rate=1.0 that can silently lie. Core fix (Structured Outputs) looks good and should close #221.

Comment thread packages/valory/customs/superforcaster/superforcaster.py Outdated
Comment thread benchmark/ci_replay.py Outdated
Comment thread benchmark/prompt_replay.py
LOCKhart07 and others added 5 commits April 17, 2026 03:25
Schema descriptions no longer carry "Step N —" prefixes; the prompt
body remains the single source of ordering. Addresses drift where
reflection was labelled Step 6 in the schema but Step 5 in the prompt
body after steps 5 and 7 were collapsed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap candidate-failure question text in inline code and replace literal
backticks, matching the existing defence on raw_response. Prevents a
crafted on-chain question containing </details>, backticks, or HTML
from breaking the collapsed <details> block in the public PR comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visible

Replace hardcoded ``baseline_parse_rate = 1.0`` with an accounted
invariant. ``_load_and_filter_rows`` now returns per-reason rejection
counts; ``enrich`` persists them as a ``{output}.filter_stats.json``
sidecar; ``_log_replay_summary`` and the ci_replay PR comment render
a Pre-filter block with a warning marker when ``not_valid_parse`` is
nonzero. The other four rejection buckets are expected sample scoping
and stay informational only.

The replay summary still declares baseline as 100% by construction —
but now there is an independent observation that would surface a
regression in the upstream "drop non-valid parses" behaviour, instead
of silently repeating the same #221 failure mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Reliability section

Baseline parse rate is always 100% by construction (the enrich filter
drops non-valid rows) so a baseline-vs-candidate Parse reliability
block compares a tautology against a measurement. Drop that framing:
move the primary metrics table to the top of the PR comment, then
render a single Reliability section below with two one-sided bullets —

- Candidate parse rate: N/M (X.X%) ✅|⚠️
- Pre-filter (enrich): A accepted, R rejected, not_valid_parse=N ✅|⚠️

Breakdown and scoping lines only surface on drift / rejections, so the
happy path stays tight. The comparison table now leads the comment,
which is what reviewers actually scan first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@mariapiamo mariapiamo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2
Stale parse-failure artifacts can leak into new reports
Ignora
candidate_failures.jsonl is only written when failures exist and is never removed when a run has zero failures. Because ci_replay.py reads this file whenever it exists, reusing an output directory can make a clean run report old failures as if they were current.

P3
Pre-filter stats can become stale across runs
Ignora
filter_stats.json is only written when sidecar stats are present. If a previous run left this file behind and a later run has no sidecar, ci_replay.py can still load old stats and show misleading pre-filter numbers for the current benchmark.

Note re ci_replay

I know the regression set is implemented, but in ci_replay both baseline and candidate are 100/100 parse-valid, so this is a ceiling result and doesn’t prove improved reliability. Can we increase the replay sample (e.g., 200 per platform) and run 2–3 seeds for better comparison power? Also, please report parse-valid / total attempted calls before filtering.
Practical target:

  • 200+ per platform for PR runs.
  • 3 seeds (42, 1337, 2026).
  • Keep regression-set run separate from random-sample run.

candidate_failures.jsonl and filter_stats.json are written conditionally
(only when failures / stats exist), so a reused output_dir would leak a
prior run's sidecars into ci_replay and surface them as current. Prep now
unlinks both before each run, keeping "files in output_dir correspond to
this run only" as an invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LOCKhart07
Copy link
Copy Markdown
Member Author

@mariapiamo both P2 and P3 are the same root-cause bug — addressed in dec17aa.

P2 candidate_failures.jsonl is only written when failures exist and is never removed when a run has zero failures. Because ci_replay.py reads this file whenever it exists, reusing an output directory can make a clean run report old failures as if they were current.

P3 filter_stats.json is only written when sidecar stats are present. If a previous run left this file behind and a later run has no sidecar, ci_replay.py can still load old stats and show misleading pre-filter numbers for the current benchmark.

Root cause: output_dir.mkdir(parents=True, exist_ok=True) preserved prior-run contents, and both sidecars are written only on non-empty paths, so a clean run silently inherited stale files.

Fix. Extracted the output-dir prep into _prepare_output_dir(output_dir) in benchmark/prompt_replay.py, which now mkdirs and then unlink(missing_ok=True)s both sidecars at the top of every replay. Keeps the invariant "files in output_dir correspond to this run only" at the exact seam where it was being broken.

Coverage (TDD): two new tests in TestPrepareOutputDir (seed stale sidecar → call prep → assert absent) failed against the old code and pass now; plus guard tests that unrelated artifacts like enriched_with_new_reasoning.jsonl are preserved and that prep is a no-op when sidecars are absent. Full benchmark/tests/ at 465/465.

Rejected alternatives.

  • Always-write with empty body — lies about filter_stats.json: missing legitimately means "older pipeline without the enrich sidecar" (see the comment at ci_replay.py:476), so writing {} would claim we have stats we don't.
  • Nuke the whole output_dir — would also wipe enriched_with_new_reasoning.jsonl that users may deliberately chain between iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LOCKhart07
Copy link
Copy Markdown
Member Author

LOCKhart07 commented Apr 17, 2026

/benchmark superforcaster --sample 200

^ Did not finish as the run hit timeout due to number of markets

@LOCKhart07
Copy link
Copy Markdown
Member Author

/benchmark superforcaster --sample 100 --seed 1337

@LOCKhart07
Copy link
Copy Markdown
Member Author

/benchmark superforcaster --sample 100 --seed 2026

Copy link
Copy Markdown
Collaborator

@bennyjo bennyjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed on d6fd4f0. All three prior concerns resolved: schema step-numbering drift stripped at source, question_text backtick-escaped + inline-coded, and hard-coded baseline_parse_rate=1.0 now paired with an independent filter_stats.json sidecar that flags ⚠️ if the upstream invariant regresses. Stale-sidecar purge (dec17aa) closes mariapiamo's P2/P3. Critical fix — ship it.

@LOCKhart07
Copy link
Copy Markdown
Member Author

@mariapiamo thanks — all three asks (larger sample, multi-seed, parse-valid / total-attempted before filtering) are fair as benchmark-infrastructure upgrades. Scoped them out of this PR into #233 so the #221 fix can land without pulling in the broader overhaul:

  • render baseline pre-filter parse rate in the Reliability block
  • thread --seed into the comment footer for legibility across multi-seed runs
  • first-class multi-seed benchmark runs with aggregated output
  • curated regression-set run separate from random-sample run

In the meantime I've triggered two extra runs on this PR at --sample 100 with seeds 1337 and 2026. Between those and the original --sample 50 --seed 42 run we'll have three independent data points on candidate parse reliability — will post the numbers once the two new comments land.

On the "ceiling result" framing: agreed for the baseline side — 100% is a tautology because the enrich step pre-filters to valid rows. For the candidate side on this specific PR the guarantee is architectural rather than statistical: OpenAI Structured Outputs against the PredictionResult pydantic schema physically can't emit the XML-scaffolding → free-form JSON mode that #221 was seeing in prod. So 100/100 (and the upcoming 200/200 × 2 seeds) is less "our point estimate happens to be 100%" and more "does this failure mode still exist at all?" — any single non-parseable candidate across 3 × 200 = 600 calls would show up here.

The general upgrades in #233 still apply to everything the benchmark suite does after this PR, where the claims will be probabilistic rather than architectural.

@github-actions
Copy link
Copy Markdown

Benchmark: superforcaster

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.3203 0.2802 -12.5%
Directional Accuracy 59.5% 64.7% +8.7%
Overconf-wrong 54 43 -20.4%
Overconf-wrong rate 0.2687 0.2139 -20.4%

Reliability

  • Candidate parse rate: 201/201 (100.0%) ✅
  • Pre-filter (enrich): 1070 accepted, 1223 rejected, not_valid_parse=0 ✅
    • Scoping: wrong_tool=1223, no_deliver_id=0, no_outcome=0, older_than_cutoff=0
Per-platform breakdown

Omen (n=100)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.2908 0.2809 -3.4%
Directional Accuracy 69.0% 70.0% +1.4%
Overconf-wrong 31 30 -3.2%
Overconf-wrong rate 0.3100 0.3000 -3.2%

Polymarket (n=101)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.3496 0.2795 -20.1%
Directional Accuracy 50.0% 59.4% +18.8%
Overconf-wrong 23 13 -43.5%
Overconf-wrong rate 0.2277 0.1287 -43.5%

201 markets | triggered by @LOCKhart07

@github-actions
Copy link
Copy Markdown

Benchmark: superforcaster

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.3082 0.2616 -15.1%
Directional Accuracy 60.7% 64.7% +6.6%
Overconf-wrong 50 36 -28.0%
Overconf-wrong rate 0.2488 0.1791 -28.0%

Reliability

  • Candidate parse rate: 201/201 (100.0%) ✅
  • Pre-filter (enrich): 1070 accepted, 1223 rejected, not_valid_parse=0 ✅
    • Scoping: wrong_tool=1223, no_deliver_id=0, no_outcome=0, older_than_cutoff=0
Per-platform breakdown

Omen (n=100)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.2796 0.2385 -14.7%
Directional Accuracy 69.0% 74.0% +7.2%
Overconf-wrong 31 26 -16.1%
Overconf-wrong rate 0.3100 0.2600 -16.1%

Polymarket (n=101)

Metric Baseline (prod) Candidate (PR) Delta
Brier score 0.3365 0.2845 -15.4%
Directional Accuracy 52.5% 55.4% +5.7%
Overconf-wrong 19 10 -47.4%
Overconf-wrong rate 0.1881 0.0990 -47.4%

201 markets | triggered by @LOCKhart07

@LOCKhart07 LOCKhart07 merged commit eb69009 into main Apr 17, 2026
16 checks passed
@LOCKhart07 LOCKhart07 deleted the fix/221-superforcaster-structured-outputs branch April 17, 2026 12:13
LOCKhart07 added a commit that referenced this pull request Apr 17, 2026
…comment

Two reliability-labeling improvements to the /benchmark PR comment, per
#231 review feedback and the follow-ups in #233:

1. Render baseline pre-filter parse rate alongside the existing Pre-filter
   line. Post-filter baseline=100% is a tautology because enrich drops
   non-valid rows; the pre-filter ratio (accepted / (accepted + not_valid_parse))
   is what tells reviewers how noisy production actually was.

2. Thread --seed and --trigger-comment-url through the workflow into
   ci_replay so multi-seed runs posted by different triggering comments
   are distinguishable in-place. Seed lands in the footer (already rendered
   if present in meta, just never populated); trigger-comment URL wraps
   the @user mention in a markdown link back to the originating comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

superforcaster: <facts> reasoning leaks into toolResponse (~25–30% of deliveries, stepped up 2026-04-08)

3 participants