t1011: Model contest mode — run top-3 models in parallel, cross-rank results by marcusquinn · Pull Request #1304 · marcusquinn/aidevops

marcusquinn · 2026-02-12T21:53:17Z

Summary

Model contest mode for the supervisor (t1011). When model selection is uncertain, dispatches the same task to top-3 models in parallel, then cross-ranks all outputs to pick the winner. Builds permanent routing data for future model selection.

Ref #1301

Changes

New: contest-helper.sh (standalone orchestrator)

create — creates a contest with entries for top-3 models
dispatch — launches parallel workers (one per model) via supervisor
evaluate — cross-ranks outputs (each model scores all, anonymised as A/B/C)
apply — promotes winners PR, cancels losers
should-contest — detects when contest mode should trigger
pulse-check — for supervisor pulse integration
Records results in both pattern-tracker and response-scoring DBs

Modified: supervisor-helper.sh

DB migration: Contest tables (contests, contest_entries) created in ensure_db()
Model routing: resolve_task_model() detects model:contest (priority 0) and auto-contest via SUPERVISOR_CONTEST_AUTO=true (step 4.5)
Dispatch: cmd_dispatch() intercepts CONTEST model resolution, delegates to contest-helper
Pulse Phase 2.5: Checks running contests for completion, triggers evaluation + apply
Main router: Added contest command that delegates to contest-helper.sh

New: tests/test-contest-helper.sh

20 tests covering: syntax, ShellCheck, help, table creation, contest CRUD, duplicate prevention, should-contest logic, error handling
All 20 pass

Updated: subagent-index.toon

Added contest-helper.sh entry

Flow

Supervisor detects uncertainty (no pattern data, new task type, or explicit model:contest)
Dispatches 3 workers with same prompt to different models (e.g. opus, sonnet, gemini)
Collects outputs when all workers complete
Sends all 3 outputs to each model for blind cross-ranking (outputs anonymised as A/B/C)
Aggregates scores, picks winner
Records results in pattern-tracker and response-scoring DB
Applies winning output (promotes winners PR, cancels losers)

Trigger conditions

Explicit: model:contest in TODO.md task entry
Auto (opt-in via SUPERVISOR_CONTEST_AUTO=true):
- No pattern data for the task type
- Insufficient samples (<3)
- Low success rate (<75%)

Cost

~3x a single run, but builds permanent routing data. Only triggers for genuinely uncertain cases.

Testing

All 20 tests pass: bash tests/test-contest-helper.sh --verbose

Summary by CodeRabbit

Release Notes

New Features
- Added contest mode: submit uncertain tasks to top 3 models simultaneously and automatically select the best result based on cross-ranking evaluation.
- Introduced contest command with controls for creating contests, dispatching to models, evaluating results, and managing outcomes.
Tests
- Added comprehensive test coverage for contest mode functionality.

gemini-code-assist · 2026-02-12T21:53:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-12T21:53:27Z

Walkthrough

Introduces a model contest mode system that dispatches uncertain tasks to top-3 models in parallel, collects anonymized outputs, cross-ranks results through weighted scoring criteria, and promotes the winner's output to the original task. Includes SQLite-backed contest lifecycle management, new supervisor command delegation, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Contest Helper & Tests `.agents/scripts/contest-helper.sh`, `tests/test-contest-helper.sh`	New contest orchestration system with SQLite persistence for contests/contest_entries tables, model selection via registry, parallel task dispatch, anonymized output collection, multi-judge cross-ranking with weighted scoring, winner promotion, and full test harness covering lifecycle, DB schema, duplicate prevention, and CLI commands.
Supervisor Integration `.agents/scripts/supervisor-helper.sh`	Adds contest mode detection in model resolution (task_model=="contest" → "CONTEST"), delegates dispatch to contest-helper.sh, integrates new public `contest` CLI command via `cmd_contest()`, extends pulse-check with Phase 2.5 contest evaluation tracking, and manages DB schema migrations for contests and contest_entries tables.
Registry Entry `.agents/subagent-index.toon`	Registers contest-helper.sh in model registry with description of contest mode capabilities (create, dispatch, status, evaluate, apply, list, should-contest, pulse-check).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Supervisor
    participant ContestHelper
    participant ModelRegistry
    participant Database as SQLite<br/>DB
    participant Models
    participant Judges
    
    Client->>Supervisor: dispatch task with model="contest"
    Supervisor->>Supervisor: resolve_task_model() → "CONTEST"
    Supervisor->>ContestHelper: create contest
    ContestHelper->>Database: INSERT contests, contest_entries
    ContestHelper->>ModelRegistry: select_top_models(3)
    ModelRegistry-->>ContestHelper: model1, model2, model3
    Supervisor->>ContestHelper: dispatch contest
    ContestHelper->>Models: dispatch subtask to each model (parallel)
    Models->>Models: process task
    Models->>Database: store results
    Supervisor->>ContestHelper: evaluate contest (when running)
    ContestHelper->>Database: fetch anonymized outputs (A/B/C)
    ContestHelper->>Judges: request cross-ranking scores
    Judges->>Judges: evaluate alternatives
    Judges-->>ContestHelper: scores per entry
    ContestHelper->>Database: aggregate weighted scores, determine winner
    ContestHelper->>Supervisor: apply winner (promote output)
    Supervisor->>Database: update original task with winner output
    Supervisor->>Models: cancel loser subtasks
    Database-->>Supervisor: contest marked complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

t1011: Model contest mode #1301 — Directly implements the t1011 "model contest mode" feature with full lifecycle management, model dispatch, cross-ranking, and scoring integration as specified in the issue.

Possibly related PRs

feat: response comparison and scoring framework for model evaluation (t168.3) #773 — Contest-helper integrates with response-scoring system to record aggregate judge scores and winner selection metrics for long-term scoring analysis.
feat: add supervisor-helper.sh with SQLite schema and state machine (t128.1) #376 — Supervisor-helper modifications (contest command, DB schema migrations, dispatch delegation) extend the same script and command handling infrastructure.

Poem

🏆 Three models enter, one shall rise,
Anonymized to judge with clear eyes,
Contests crowned by weighted might,
Winners flow where tasks took flight,
Contest mode brings clarity to sight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (6 files): ⚔️ `.agents/scripts/issue-sync-helper.sh` (content) ⚔️ `.agents/scripts/supervisor-helper.sh` (content) ⚔️ `.agents/scripts/supervisor/issue-sync.sh` (content) ⚔️ `.agents/subagent-index.toon` (content) ⚔️ `TODO.md` (content) ⚔️ `todo/VERIFY.md` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the primary change: introducing model contest mode that dispatches tasks to top-3 models in parallel and cross-ranks results, which aligns with the core feature added across contest-helper.sh and supervisor-helper.sh.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1011

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch feature/t1011
Create stacked PR with resolved conflicts
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-12T21:53:46Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 0 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 12 21:53:42 UTC 2026: Code review monitoring started
Thu Feb 12 21:53:42 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 0

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 0
VULNERABILITIES: 0

Generated on: Thu Feb 12 21:53:45 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

… routing (t1011)

…SC2155 (t1011)

github-actions · 2026-02-12T22:18:18Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 0 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 12 22:18:15 UTC 2026: Code review monitoring started
Thu Feb 12 22:18:15 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 0

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 0
VULNERABILITIES: 0

Generated on: Thu Feb 12 22:18:17 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-12T22:19:12Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist · 2026-02-12T22:19:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/scripts/supervisor-helper.sh (1)
13328-13347: ⚠️ Potential issue | 🟠 Major

Avoid embedding GH_TOKEN in crontab (plaintext secret exposure).
Line 13333–13345 injects GH_TOKEN directly into the cron entry, which is readable via crontab -l and often visible to system tooling. The script already resolves GH_TOKEN at runtime (cache/keyring/credentials), so you can keep the cron entry free of secrets and rely on the existing token resolution logic.
🔒 Proposed fix (keep PATH, drop GH_TOKEN from cron entry)
-	# Detect GH_TOKEN from gh CLI if available (t1006)
-	local gh_token=""
-	if command -v gh &>/dev/null; then
-		gh_token=$(gh auth token 2>/dev/null || true)
-	fi
-
 	# Build cron command with environment variables
 	local env_vars=""
 	if [[ -n "$user_path" ]]; then
 		env_vars="PATH=${user_path}"
 	fi
-	if [[ -n "$gh_token" ]]; then
-		env_vars="${env_vars:+${env_vars} }GH_TOKEN=${gh_token}"
-	fi

🤖 Fix all issues with AI agents

In @.agents/scripts/contest-helper.sh:
- Around line 57-61: The db() helper currently hides all sqlite3 stderr by
appending "2>/dev/null" which hides real errors; change db() to stop
unconditionally discarding stderr — remove the global "2>/dev/null" and instead
add an optional suppression parameter or respect an environment flag (e.g.,
DB_SILENT) so callers can opt into silencing; ensure db() invokes sqlite3 with
the same arguments but leaves stderr unredirected by default, and update call
sites that intentionally probe (e.g., table-existence checks) to pass the
suppression flag or explicitly redirect stderr to /dev/null there or to a log
file.
- Around line 964-968: The loop clamping scores currently uses eval for indirect
assignment (loop over int_correct int_complete int_quality int_clarity), replace
eval with bash's printf -v to set the variable by name without eval: after
computing local val="${!var}", use printf -v "$var" '%s' "$val" (or the clamped
numeric value) so the four variables are safely updated without eval; update the
block that iterates over int_correct/int_complete/int_quality/int_clarity to use
printf -v for assignment.
- Around line 186-202: The integer-comparison fails when sed yields empty
strings for total_samples or success_rate; after extracting from pattern_json
(variables total_samples and success_rate), ensure they default to 0 before the
[[ comparisons — e.g. immediately after the sed assignments set
total_samples=${total_samples:-0} and success_rate=${success_rate:-0} (or use
parameter expansion when comparing) so [[ "$total_samples" -lt 3 ]] and [[
"$success_rate" -lt 75 ]] always receive numeric values.
- Around line 341-361: The loop that splits $models uses "local IFS=','" and
iterates an unquoted $models, then calls unset IFS — replace that fragile IFS
manipulation by reading $models into an array with read -ra (e.g. read -ra
model_arr <<< "$models") and iterate over "${model_arr[@]}"; keep the existing
logic that increments model_index and constructs entry_id/entry_task_id and the
db/sql_escape/log_info calls (references: models, model_index, entry_id,
entry_task_id, db, sql_escape, log_info), and remove the local IFS/unset IFS
handling.
- Around line 1139-1181: The cmd_pulse_check logic currently only selects
contests that already have zero non-terminal entries, so _sync_entry_statuses
never runs for contests with stale dispatched/running subtasks; change the flow
to first enumerate running contests, call _sync_entry_statuses for each
contest_id, then re-query that contest's entries to see if pending count is zero
and proceed to cmd_evaluate/cmd_apply; specifically update cmd_pulse_check to
fetch running contest ids (no subquery filtering), call _sync_entry_statuses
"$contest_id" immediately for each, then run the existing pending-count query
and evaluation steps for that same contest_id.
- Around line 724-731: The script currently passes the large variable
ranking_prompt directly to opencode via --prompt which can hit ARG_MAX and the
trailing "|| true" hides E2BIG failures; modify the block that invokes opencode
(the timeout/opencode run call) to write ranking_prompt to a temp file (e.g.,
use the existing score_tmpfile pattern or a new prompt_tmpfile) and pass that
file to opencode using the CLI's file-based input option (or feed via stdin)
instead of --prompt "$ranking_prompt"; also remove the unconditional "|| true"
and handle non-zero exit by logging the error and preserving any opencode stderr
for debugging so you don't silently drop E2BIG errors.
- Around line 643-651: The script currently hardcodes "main" when building the
diffs (variables summary and full_diff using git -C "$ewt" diff "main..HEAD")
and swallows git errors, which hides repos whose default is "master" or another
branch; change it to detect the repo's default branch first (e.g., run git -C
"$ewt" to get origin/HEAD via symbolic-ref or rev-parse and strip the "origin/"
prefix) into a variable like base_branch, fall back to "main" only if detection
fails, then use "$base_branch..HEAD" for both git diff --stat and git diff, and
stop redirecting stderr to /dev/null so failures surface (or at least preserve
error output for logging) instead of silently returning "No diff available";
update references to ewt, summary, and full_diff accordingly.

In @.agents/scripts/supervisor-helper.sh:
- Around line 5859-5887: Phase 1 is prematurely evaluating contest tasks because
they remain status 'running' and lack PID files; fix by either (A) changing the
Phase 1 selection query to exclude tasks where error LIKE 'contest:%' (i.e. add
AND error NOT LIKE 'contest:%' to the tasks query used by Phase 1), or (B) when
delegating to contest-helper.sh in the contest branch (the block that calls
contest-helper.sh, sets contest_id and calls db "...UPDATE tasks SET error =
'contest:${contest_id}'..." and then returns), update the task row to a distinct
status such as 'contest_running' instead of leaving it 'running' (modify the db
UPDATE in that block to set status='contest_running' and
error='contest:${contest_id}'), and ensure Phase 1 excludes 'contest_running'
from its status IN (...) checks; implement one of these approaches so Phase 1
skips in-flight contest tasks.

🧹 Nitpick comments (2)

.agents/scripts/contest-helper.sh (2)
1036-1042: Storing contest-winner metadata in the error column is semantically misleading.

Downstream tooling or dashboards that treat non-null error as a failure indicator will misreport winning tasks as errored. Consider using the metadata or description field on the task instead, or adding a dedicated column.

733-743: Fragile regex-based JSON parsing — consider jq with a fallback.

The grep -oE '\{[^}]*"label"[^}]*\}' breaks on multi-line or nested JSON output from the judge models. Since this is the core scoring path, a mis-parse silently produces zero scores and a meaningless winner.

jq is widely available in DevOps environments and would handle edge cases robustly:
Proposed approach
+		if command -v jq &>/dev/null; then
+			json_scores=$(echo "$score_output" | jq -c 'select(.label)' 2>/dev/null || true)
+		else
 			json_scores=$(echo "$score_output" | grep -oE '\{[^}]*"label"[^}]*\}' || true)
+		fi

coderabbitai · 2026-02-12T22:25:52Z