t1329: Cross-review judge pipeline and /cross-review slash command#2269
t1329: Cross-review judge pipeline and /cross-review slash command#2269marcusquinn wants to merge 3 commits intomainfrom
Conversation
- Add _resolve_cross_review_auth() for ANTHROPIC_API_KEY / OAuth auth - Add _judge_cross_review() that dispatches all model outputs to a judge model (default: opus) via Anthropic API, returns structured JSON scores (correctness, completeness, quality, clarity, overall, winner) - Extend cmd_cross_review() with --score, --judge, --task-type flags - Wire judge output into cmd_score() -> model-comparisons SQLite DB - Sync winner/loser data to pattern tracker via ab-compare (source: cross-review-judge) - Create /cross-review slash command at .agents/scripts/commands/cross-review.md - Update cmd_help() with new flags and examples - ShellCheck clean, all 45 response-scoring tests pass Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus). Chose Anthropic API direct call (matches routine-scheduler.sh pattern) over runner-helper dispatch for the judge — simpler, no runner lifecycle overhead for a single synchronous call.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the AI model cross-review functionality by integrating an automated judge scoring pipeline. It allows for objective evaluation of multiple AI model responses against a given prompt, recording detailed scores and identifying a winner. This system not only provides structured feedback but also feeds into a pattern tracker to improve future model routing decisions, making the model comparison process more robust and data-driven. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
WalkthroughIntroduces a multi-model cross-review workflow with judge-based scoring capabilities. The feature allows parallel model comparison, automatic diff computation, and optional scoring through a configurable judge model that normalizes results and integrates with pattern tracking and persistent storage. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CrossReview as Cross-Review<br/>(cmd_cross_review)
participant Judge as Judge Model<br/>API
participant PatternTracker as Pattern<br/>Tracker
participant Storage as Judge<br/>Storage
User->>CrossReview: Invoke with --score flag
CrossReview->>CrossReview: Dispatch prompt to multiple models
CrossReview->>CrossReview: Collect outputs & compute diffs
CrossReview->>Judge: Invoke judge with prompt + outputs
Judge->>Judge: Analyze & score models
Judge-->>CrossReview: Return JSON scores (1-10)
CrossReview->>CrossReview: Normalize scores (1-10 → 1-5)
CrossReview->>CrossReview: Update model-scoring display
CrossReview->>PatternTracker: Feed A/B comparison data
PatternTracker-->>CrossReview: Pattern tracking complete
CrossReview->>Storage: Persist judge outputs
Storage-->>CrossReview: Stored to judge-scores.json
CrossReview-->>User: Return results & scoring
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 25 02:51:13 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: an automated judge pipeline for cross-model reviews, complete with a new /cross-review slash command. The implementation is comprehensive, covering API authentication, parallel model execution, dispatching to a judge model, and recording scores in a database.
My review focuses on improving the script's robustness and error handling. I've identified a few places where error output is suppressed (2>/dev/null), which could hide critical issues with API calls or database writes. I've also suggested a safer way to pass variables to jq and consolidated multiple jq calls for better performance and reliability. These changes will make the new pipeline more reliable and easier to debug.
| response=$(curl -s --max-time 120 \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "anthropic-version: 2023-06-01" \ | ||
| -H "${auth_header}" \ | ||
| "${curl_extra_args[@]}" \ | ||
| -d "$request_body" \ | ||
| "https://api.anthropic.com/v1/messages" 2>/dev/null) || { |
There was a problem hiding this comment.
Suppressing stderr for the curl command with 2>/dev/null is risky as it hides important errors related to network connectivity, authentication, or API issues. The -s flag is already used to silence progress output. Removing the redirection will allow critical error messages to be visible for debugging, which is important even with the || check that follows.
| response=$(curl -s --max-time 120 \ | |
| -H "Content-Type: application/json" \ | |
| -H "anthropic-version: 2023-06-01" \ | |
| -H "${auth_header}" \ | |
| "${curl_extra_args[@]}" \ | |
| -d "$request_body" \ | |
| "https://api.anthropic.com/v1/messages" 2>/dev/null) || { | |
| response=$(curl -s --max-time 120 \ | |
| -H "Content-Type: application/json" \ | |
| -H "anthropic-version: 2023-06-01" \ | |
| -H "${auth_header}" \ | |
| "${curl_extra_args[@]}" \ | |
| -d "$request_body" \ | |
| "https://api.anthropic.com/v1/messages") || { |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
|
|
||
| # Record scores in model-comparisons DB via cmd_score | ||
| echo "Recording scores in model-comparisons DB..." | ||
| if cmd_score "${score_args[@]}" 2>/dev/null; then |
There was a problem hiding this comment.
Suppressing stderr from cmd_score with 2>/dev/null can hide important errors, such as database connection or write failures. These errors should be visible to diagnose problems with score recording.
| if cmd_score "${score_args[@]}" 2>/dev/null; then | |
| if cmd_score "${score_args[@]}"; then |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null) | ||
| m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null) | ||
| m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null) | ||
| m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null) | ||
| m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null) | ||
| m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null) | ||
| m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null) |
There was a problem hiding this comment.
These repeated jq calls have multiple issues:
- Error Suppression: Suppressing stderr with
2>/dev/nullviolates a general project rule. If the judge model returns malformed JSON, all parsing errors will be silenced, making debugging difficult. Errors should be allowed to surface. - Variable Injection: Injecting
$model_namedirectly into thejqfilter string is not robust. It's safer to usejq --argto pass shell variables, which prevents parsing errors if the variable contains special characters. - Efficiency: Multiple
jqcalls can be inefficient. Consolidating these into a singlejqinvocation improves performance and maintainability.
The suggested code addresses all these points by using jq --arg, removing 2>/dev/null, and consolidating the value extraction into a single pass.
| m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null) | |
| m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null) | |
| m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null) | |
| m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null) | |
| m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null) | |
| m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null) | |
| m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null) | |
| read -r m_corr m_comp m_qual m_clar m_overall m_str m_wea <<< "$(echo "$judge_json" | jq -r --arg model_name "$model_name" ' | |
| .scores[$model_name] | | |
| [ | |
| (.correctness // 0), | |
| (.completeness // 0), | |
| (.quality // 0), | |
| (.clarity // 0), | |
| (.overall // 0), | |
| (.strengths // ""), | |
| (.weaknesses // "") | |
| ] | @tsv | |
| ')" |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
- In shell scripts, use
jq --argfor strings and--argjsonfor other JSON types (like numbers) to safely pass variables into ajqfilter. This avoids syntax errors if the variables contain special characters. - Consolidate multiple 'jq' calls into a single pass where possible to improve performance and script efficiency.
🤖 Augment PR SummarySummary: Adds an automated “judge” scoring step to the existing cross-review pipeline and exposes it via a new Changes:
Technical Notes: Judge calls use the Anthropic Messages API ( 🤖 Was this summary useful? React with 👍 or 👎 |
| response=$(curl -s --max-time 120 \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "anthropic-version: 2023-06-01" \ | ||
| -H "${auth_header}" \ |
There was a problem hiding this comment.
auth_header (API key / OAuth token) is passed to curl via -H, which typically exposes the secret in the process list (ps) on multi-user systems. Consider an approach that avoids putting credentials on the command line.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| # Strip markdown code fences if present | ||
| local clean_json | ||
| clean_json=$(echo "$ai_text" | sed -n '/^{/,/^}/p' | head -100) |
There was a problem hiding this comment.
Using head -100 here can truncate a valid (pretty-printed/long) JSON response, causing jq validation to fail even when the judge returned correct JSON. That could make --score flaky as the number/size of model outputs grows.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| --completeness "$m_comp" | ||
| --quality "$m_qual" | ||
| --clarity "$m_clar" | ||
| --adherence "$m_overall" |
There was a problem hiding this comment.
cmd_score documents --adherence as instruction-following and also recomputes its own overall as an average including adherence; mapping judge overall into --adherence will skew/duplicate the overall score stored in the DB. This may make results/analytics misleading when comparing runs.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/commands/cross-review.md:
- Around line 13-15: The docs claim the review prompt can be positional but the
implementation of cmd_cross_review only accepts --prompt; either update the docs
to state --prompt is required or modify cmd_cross_review to accept a positional
prompt as well: if choosing docs, change the description to remove "positional"
and clarify "--prompt is required"; if choosing code, update cmd_cross_review's
argument parsing to treat a trailing positional argument as the prompt (fall
back to --prompt if provided) and ensure the help/usage and tests reflect the
new behavior.
In @.agents/scripts/compare-models-helper.sh:
- Around line 1090-1093: The current logic calls _judge_cross_review and on
failure logs with print_warning but then returns 0, causing a success exit;
change this to return a non-zero code so CI detects failure—specifically update
the failure branch that handles the _judge_cross_review call (the judge_json
assignment block) to return 1 (or propagate the failing exit status) instead of
return 0, preserving the existing print_warning and saved outputs behavior;
ensure this change is made in the function that contains the _judge_cross_review
invocation so --score requests fail fast when judge scoring fails.
- Around line 779-782: resolve_model_tier() can return provider-prefixed model
IDs which the Anthropic Messages API rejects; before sending judge_model_id to
Anthropic, strip any provider prefix (e.g., "anthropic/") so only the raw model
token (like "claude-opus-4-6") is used. Update the logic around judge_model_id
assignment (the resolve_model_tier call and its result) to detect and remove a
leading "<provider>/" segment (or use parameter expansion) so downstream code
that calls the Anthropic Messages API receives the raw model name; keep
resolve_model_tier and judge_model_id as the referenced symbols to locate the
change.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.agents/scripts/commands/cross-review.md.agents/scripts/compare-models-helper.sh
Flagged for Human ReviewReason: PR #2269 (t1329: cross-review judge pipeline) has 1 rejection review. Needs human decision on whether to address reviewer feedback or override. Two competing PRs exist (#2269 and #2267) covering overlapping scope (t1328+t1329). This issue has been flagged by the AI supervisor for human review. Please assess and take appropriate action. Flagged by AI Supervisor (automated reasoning cycle) |
- Strip provider prefix from judge_model_id before Anthropic API call - Return non-zero exit code (return 1) when judge scoring fails - Remove 2>/dev/null from curl, jq, and cmd_score calls for debuggability - Consolidate multiple jq calls into single passes using --arg for safety - Remove head -100 truncation that could break valid JSON responses - Add adherence criterion to judge prompt and fix score mapping - Update cross-review.md: remove incorrect 'positional' prompt docs
- Pass curl auth headers via --config process substitution instead of -H flags to prevent credential exposure in process list (Augment review) - Remove head -100 truncation on diff output that could cause flaky results with large model outputs (Augment review) - Remove unnecessary 2>/dev/null on diff command
|
All review feedback has been addressed: Already fixed in prior commit (2e79284):
Fixed in this commit (2415944):
@coderabbitai please re-review |
|
Security & reliability improvements:
Output handling improvements:
API & documentation improvements:
I'll trigger a fresh review of the changes now. ✅ Actions performedReview triggered.
|
All review feedback addressed in commit 2e79284
|
Closing: superseded by PR #2267 (merged as 7beb161), which included the t1329 cross-review judge pipeline alongside t1328. The review feedback from this PR (CodeRabbit, Gemini, Augment) was addressed in the final commits of #2267. Merge conflicts are due to the overlapping scope already landed on main. |



Wire existing cross-review output into automated scoring via a judge model.
Changes
--scoreflag tocmd_cross_review()incompare-models-helper.sh_judge_cross_review()function dispatches all model outputs to a configurable judge model (default: opus) via Anthropic API, returning structured JSON scores (correctness, completeness, quality, clarity, overall, winner + reasoning)_resolve_cross_review_auth()for ANTHROPIC_API_KEY / OpenCode OAuth auth (matches routine-scheduler.sh pattern)cmd_score()ab-compare(source:cross-review-judge)/cross-reviewslash command at.agents/scripts/commands/cross-review.mdcmd_help()updated with new flags and examplesUsage
Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus before commit).
Ref #2262
Summary by CodeRabbit
New Features