Skip to content

t1329: Cross-review judge pipeline and /cross-review slash command#2269

Closed
marcusquinn wants to merge 3 commits intomainfrom
feature/t1329
Closed

t1329: Cross-review judge pipeline and /cross-review slash command#2269
marcusquinn wants to merge 3 commits intomainfrom
feature/t1329

Conversation

@marcusquinn
Copy link
Copy Markdown
Owner

@marcusquinn marcusquinn commented Feb 25, 2026

Wire existing cross-review output into automated scoring via a judge model.

Changes

  • Add --score flag to cmd_cross_review() in compare-models-helper.sh
  • New _judge_cross_review() function dispatches all model outputs to a configurable judge model (default: opus) via Anthropic API, returning structured JSON scores (correctness, completeness, quality, clarity, overall, winner + reasoning)
  • New _resolve_cross_review_auth() for ANTHROPIC_API_KEY / OpenCode OAuth auth (matches routine-scheduler.sh pattern)
  • Judge scores recorded in model-comparisons SQLite DB via cmd_score()
  • Winner/loser data synced to pattern tracker via ab-compare (source: cross-review-judge)
  • New /cross-review slash command at .agents/scripts/commands/cross-review.md
  • cmd_help() updated with new flags and examples
  • ShellCheck clean, all 45 response-scoring tests pass

Usage

# Basic diff (existing behaviour, unchanged)
compare-models-helper.sh cross-review --prompt "Review this code" --models "sonnet,opus"

# Full judge pipeline (new)
compare-models-helper.sh cross-review --prompt "Review this PR diff" --models "sonnet,opus,pro" --score

# Custom judge and task type
compare-models-helper.sh cross-review --prompt "Audit architecture" --models "sonnet,opus" --score --judge opus --task-type analysis

Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus before commit).

Ref #2262

Summary by CodeRabbit

New Features

  • Multi-model cross-review now supports optional judge-based scoring to evaluate and compare outputs across models
  • Configurable judge model and task type selection for custom scoring criteria
  • Judge scores stored and can integrate with pattern tracker for AB testing workflows

- Add _resolve_cross_review_auth() for ANTHROPIC_API_KEY / OAuth auth
- Add _judge_cross_review() that dispatches all model outputs to a judge
  model (default: opus) via Anthropic API, returns structured JSON scores
  (correctness, completeness, quality, clarity, overall, winner)
- Extend cmd_cross_review() with --score, --judge, --task-type flags
- Wire judge output into cmd_score() -> model-comparisons SQLite DB
- Sync winner/loser data to pattern tracker via ab-compare (source: cross-review-judge)
- Create /cross-review slash command at .agents/scripts/commands/cross-review.md
- Update cmd_help() with new flags and examples
- ShellCheck clean, all 45 response-scoring tests pass

Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus).
Chose Anthropic API direct call (matches routine-scheduler.sh pattern) over runner-helper
dispatch for the judge — simpler, no runner lifecycle overhead for a single synchronous call.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AI model cross-review functionality by integrating an automated judge scoring pipeline. It allows for objective evaluation of multiple AI model responses against a given prompt, recording detailed scores and identifying a winner. This system not only provides structured feedback but also feeds into a pattern tracker to improve future model routing decisions, making the model comparison process more robust and data-driven.

Highlights

  • Automated Judge Scoring Pipeline: Introduced a new --score flag to cmd_cross_review() that enables an automated judge scoring pipeline. This dispatches model outputs to a configurable judge model (default: Opus) via the Anthropic API, returning structured JSON scores for correctness, completeness, quality, clarity, and overall assessment.
  • Anthropic API Authentication: Added a new _resolve_cross_review_auth() function to handle Anthropic API authentication, supporting both ANTHROPIC_API_KEY environment variable and OpenCode OAuth tokens, ensuring secure access to the judge model.
  • Data Persistence and Pattern Tracking: Integrated the judge scores into the model-comparisons SQLite database via cmd_score() and synced winner/loser data to the pattern tracker using ab-compare, facilitating data-driven model routing and performance analysis.
  • New Slash Command: Created a new /cross-review slash command, providing a user-friendly interface to trigger multi-model adversarial reviews with optional judge scoring directly from the command line.
  • Enhanced cmd_help(): Updated the cmd_help() function with new flags and examples for the cross-review command, improving user guidance for the newly added scoring capabilities.
Changelog
  • .agents/scripts/commands/cross-review.md
    • Added a new markdown file defining the /cross-review slash command.
    • Documented the command's purpose, arguments (--prompt, --models, --score, --judge, --task-type, --timeout, --output), and usage examples.
    • Detailed the judge scoring criteria and the expected output format.
  • .agents/scripts/compare-models-helper.sh
    • Updated the cmd_cross_review description to reflect the new judge scoring capability.
    • Implemented _resolve_cross_review_auth() to manage Anthropic API key and OAuth authentication.
    • Added _judge_cross_review() to orchestrate the judge model API call, prompt construction, and JSON response parsing for scoring.
    • Modified cmd_cross_review() to parse new command-line arguments: --score, --judge, and --task-type.
    • Integrated the judge scoring logic within cmd_cross_review(), including calling the judge model, saving results, displaying scores, and recording them to the model-comparisons database.
    • Added logic to sync winner/loser data and normalized scores to the pattern tracker via pattern-tracker-helper.sh.
    • Expanded cmd_help() with new examples and descriptions for the --score, --judge, and --task-type options.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f9d555f and 2415944.

📒 Files selected for processing (3)
  • .agents/scripts/commands/cross-review.md
  • .agents/scripts/compare-models-helper.sh
  • todo/tasks/t1329-brief.md

Walkthrough

Introduces a multi-model cross-review workflow with judge-based scoring capabilities. The feature allows parallel model comparison, automatic diff computation, and optional scoring through a configurable judge model that normalizes results and integrates with pattern tracking and persistent storage.

Changes

Cohort / File(s) Summary
Documentation
.agents/scripts/commands/cross-review.md
New comprehensive guide documenting the cross-review tool, including argument parsing, usage modes (basic and judge-scored), sequence of operations, results viewing, scoring criteria, examples, and related commands.
Core Cross-Review Implementation
.agents/scripts/compare-models-helper.sh
Added _resolve_cross_review_auth() for authentication handling and _judge_cross_review() for judge subsystem logic. Extended cmd_cross_review() with --score, --judge, and --task-type flags. Integrated judge-based scoring flow including API invocation, JSON result extraction, score normalization (1-10 to 1-5), pattern-tracker integration for A/B comparisons, and persistent storage to judge-scores.json. Updated help documentation with new options and examples.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CrossReview as Cross-Review<br/>(cmd_cross_review)
    participant Judge as Judge Model<br/>API
    participant PatternTracker as Pattern<br/>Tracker
    participant Storage as Judge<br/>Storage

    User->>CrossReview: Invoke with --score flag
    CrossReview->>CrossReview: Dispatch prompt to multiple models
    CrossReview->>CrossReview: Collect outputs & compute diffs
    CrossReview->>Judge: Invoke judge with prompt + outputs
    Judge->>Judge: Analyze & score models
    Judge-->>CrossReview: Return JSON scores (1-10)
    CrossReview->>CrossReview: Normalize scores (1-10 → 1-5)
    CrossReview->>CrossReview: Update model-scoring display
    CrossReview->>PatternTracker: Feed A/B comparison data
    PatternTracker-->>CrossReview: Pattern tracking complete
    CrossReview->>Storage: Persist judge outputs
    Storage-->>CrossReview: Stored to judge-scores.json
    CrossReview-->>User: Return results & scoring
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Poem

🤖 Multiple minds convene to compare,
A judge weighs outputs with expert care,
Scores flow like streams from 1 to 10,
Then normalized, stored safe again—
Cross-review wisdom, transparent and fair. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main changes: implementing a cross-review judge pipeline and slash command, directly aligned with the PR objectives and file modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1329

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 58 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 25 02:51:10 UTC 2026: Code review monitoring started
Wed Feb 25 02:51:11 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 58

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 58
  • VULNERABILITIES: 0

Generated on: Wed Feb 25 02:51:13 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: an automated judge pipeline for cross-model reviews, complete with a new /cross-review slash command. The implementation is comprehensive, covering API authentication, parallel model execution, dispatching to a judge model, and recording scores in a database.

My review focuses on improving the script's robustness and error handling. I've identified a few places where error output is suppressed (2>/dev/null), which could hide critical issues with API calls or database writes. I've also suggested a safer way to pass variables to jq and consolidated multiple jq calls for better performance and reliability. These changes will make the new pipeline more reliable and easier to debug.

Comment on lines +800 to +806
response=$(curl -s --max-time 120 \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "${auth_header}" \
"${curl_extra_args[@]}" \
-d "$request_body" \
"https://api.anthropic.com/v1/messages" 2>/dev/null) || {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr for the curl command with 2>/dev/null is risky as it hides important errors related to network connectivity, authentication, or API issues. The -s flag is already used to silence progress output. Removing the redirection will allow critical error messages to be visible for debugging, which is important even with the || check that follows.

Suggested change
response=$(curl -s --max-time 120 \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "${auth_header}" \
"${curl_extra_args[@]}" \
-d "$request_body" \
"https://api.anthropic.com/v1/messages" 2>/dev/null) || {
response=$(curl -s --max-time 120 \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "${auth_header}" \
"${curl_extra_args[@]}" \
-d "$request_body" \
"https://api.anthropic.com/v1/messages") || {
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.


# Record scores in model-comparisons DB via cmd_score
echo "Recording scores in model-comparisons DB..."
if cmd_score "${score_args[@]}" 2>/dev/null; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr from cmd_score with 2>/dev/null can hide important errors, such as database connection or write failures. These errors should be visible to diagnose problems with score recording.

Suggested change
if cmd_score "${score_args[@]}" 2>/dev/null; then
if cmd_score "${score_args[@]}"; then
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines +1124 to +1130
m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null)
m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null)
m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null)
m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null)
m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null)
m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null)
m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These repeated jq calls have multiple issues:

  1. Error Suppression: Suppressing stderr with 2>/dev/null violates a general project rule. If the judge model returns malformed JSON, all parsing errors will be silenced, making debugging difficult. Errors should be allowed to surface.
  2. Variable Injection: Injecting $model_name directly into the jq filter string is not robust. It's safer to use jq --arg to pass shell variables, which prevents parsing errors if the variable contains special characters.
  3. Efficiency: Multiple jq calls can be inefficient. Consolidating these into a single jq invocation improves performance and maintainability.

The suggested code addresses all these points by using jq --arg, removing 2>/dev/null, and consolidating the value extraction into a single pass.

Suggested change
m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null)
m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null)
m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null)
m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null)
m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null)
m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null)
m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null)
read -r m_corr m_comp m_qual m_clar m_overall m_str m_wea <<< "$(echo "$judge_json" | jq -r --arg model_name "$model_name" '
.scores[$model_name] |
[
(.correctness // 0),
(.completeness // 0),
(.quality // 0),
(.clarity // 0),
(.overall // 0),
(.strengths // ""),
(.weaknesses // "")
] | @tsv
')"
References
  1. In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
  2. In shell scripts, use jq --arg for strings and --argjson for other JSON types (like numbers) to safely pass variables into a jq filter. This avoids syntax errors if the variables contain special characters.
  3. Consolidate multiple 'jq' calls into a single pass where possible to improve performance and script efficiency.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Feb 25, 2026

🤖 Augment PR Summary

Summary: Adds an automated “judge” scoring step to the existing cross-review pipeline and exposes it via a new /cross-review slash command.

Changes:

  • Extends compare-models-helper.sh cross-review with --score, --judge, and --task-type flags.
  • Adds auth resolution for Anthropic (env API key or OpenCode OAuth auth.json).
  • Implements a judge dispatch that sends the original prompt + all model outputs to a configurable judge model and expects structured JSON scores + winner.
  • Persists judge output to judge-scores.json in the run output directory.
  • Records scores into the model-comparisons SQLite DB via cmd_score.
  • Syncs winner/loser and normalized scores into the pattern tracker (source: cross-review-judge).
  • Adds documentation for the new /cross-review command and updates cmd_help() examples.

Technical Notes: Judge calls use the Anthropic Messages API (anthropic-version: 2023-06-01), default judge tier is opus, and scoring runs keep the existing diff/summary behavior while optionally adding the DB + pattern-tracker sync step.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

response=$(curl -s --max-time 120 \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "${auth_header}" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth_header (API key / OAuth token) is passed to curl via -H, which typically exposes the secret in the process list (ps) on multi-user systems. Consider an approach that avoids putting credentials on the command line.

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


# Strip markdown code fences if present
local clean_json
clean_json=$(echo "$ai_text" | sed -n '/^{/,/^}/p' | head -100)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using head -100 here can truncate a valid (pretty-printed/long) JSON response, causing jq validation to fail even when the judge returned correct JSON. That could make --score flaky as the number/size of model outputs grows.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

--completeness "$m_comp"
--quality "$m_qual"
--clarity "$m_clar"
--adherence "$m_overall"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_score documents --adherence as instruction-following and also recomputes its own overall as an average including adherence; mapping judge overall into --adherence will skew/duplicate the overall score stored in the DB. This may make results/analytics misleading when comparing runs.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 25, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/commands/cross-review.md:
- Around line 13-15: The docs claim the review prompt can be positional but the
implementation of cmd_cross_review only accepts --prompt; either update the docs
to state --prompt is required or modify cmd_cross_review to accept a positional
prompt as well: if choosing docs, change the description to remove "positional"
and clarify "--prompt is required"; if choosing code, update cmd_cross_review's
argument parsing to treat a trailing positional argument as the prompt (fall
back to --prompt if provided) and ensure the help/usage and tests reflect the
new behavior.

In @.agents/scripts/compare-models-helper.sh:
- Around line 1090-1093: The current logic calls _judge_cross_review and on
failure logs with print_warning but then returns 0, causing a success exit;
change this to return a non-zero code so CI detects failure—specifically update
the failure branch that handles the _judge_cross_review call (the judge_json
assignment block) to return 1 (or propagate the failing exit status) instead of
return 0, preserving the existing print_warning and saved outputs behavior;
ensure this change is made in the function that contains the _judge_cross_review
invocation so --score requests fail fast when judge scoring fails.
- Around line 779-782: resolve_model_tier() can return provider-prefixed model
IDs which the Anthropic Messages API rejects; before sending judge_model_id to
Anthropic, strip any provider prefix (e.g., "anthropic/") so only the raw model
token (like "claude-opus-4-6") is used. Update the logic around judge_model_id
assignment (the resolve_model_tier call and its result) to detect and remove a
leading "<provider>/" segment (or use parameter expansion) so downstream code
that calls the Anthropic Messages API receives the raw model name; keep
resolve_model_tier and judge_model_id as the referenced symbols to locate the
change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8dce8 and f9d555f.

📒 Files selected for processing (2)
  • .agents/scripts/commands/cross-review.md
  • .agents/scripts/compare-models-helper.sh

Comment thread .agents/scripts/commands/cross-review.md
Comment thread .agents/scripts/compare-models-helper.sh
Comment thread .agents/scripts/compare-models-helper.sh
@marcusquinn marcusquinn added the needs-review Flagged for human review by AI supervisor label Feb 25, 2026
@marcusquinn
Copy link
Copy Markdown
Owner Author

Flagged for Human Review

Reason: PR #2269 (t1329: cross-review judge pipeline) has 1 rejection review. Needs human decision on whether to address reviewer feedback or override. Two competing PRs exist (#2269 and #2267) covering overlapping scope (t1328+t1329).

This issue has been flagged by the AI supervisor for human review. Please assess and take appropriate action.


Flagged by AI Supervisor (automated reasoning cycle)

- Strip provider prefix from judge_model_id before Anthropic API call
- Return non-zero exit code (return 1) when judge scoring fails
- Remove 2>/dev/null from curl, jq, and cmd_score calls for debuggability
- Consolidate multiple jq calls into single passes using --arg for safety
- Remove head -100 truncation that could break valid JSON responses
- Add adherence criterion to judge prompt and fix score mapping
- Update cross-review.md: remove incorrect 'positional' prompt docs
- Pass curl auth headers via --config process substitution instead of -H
  flags to prevent credential exposure in process list (Augment review)
- Remove head -100 truncation on diff output that could cause flaky
  results with large model outputs (Augment review)
- Remove unnecessary 2>/dev/null on diff command
@marcusquinn
Copy link
Copy Markdown
Owner Author

All review feedback has been addressed:

Already fixed in prior commit (2e79284):

  • Provider prefix stripping for Anthropic API model IDs (CodeRabbit)
  • Judge failure now returns non-zero exit code (CodeRabbit)
  • Consolidated jq calls with --arg for safe variable passing (Gemini)
  • Removed 2>/dev/null from curl and cmd_score calls (Gemini)
  • Docs updated: --prompt is required, no positional (CodeRabbit)
  • Dedicated adherence field from judge JSON, not mapped from overall (Augment)

Fixed in this commit (2415944):

  • Auth headers passed via curl --config <(...) process substitution to avoid credential exposure in ps (Augment)
  • Removed head -100 truncation on diff output that could cause flaky results with large model outputs (Augment)
  • Removed unnecessary 2>/dev/null on diff command

@coderabbitai please re-review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

@marcusquinn: Thank you for the comprehensive fixes! I can see you've addressed all the feedback systematically:

Security & reliability improvements:

  • ✅ Auth headers now protected via process substitution (no credential exposure in ps)
  • ✅ Judge failures properly return non-zero exit codes
  • ✅ Safe variable passing with jq --arg

Output handling improvements:

  • ✅ Removed head -100 truncation that could cause flaky results
  • ✅ Proper error visibility (removed unnecessary 2>/dev/null)

API & documentation improvements:

  • ✅ Provider prefix stripping for Anthropic models
  • ✅ Dedicated adherence field handling
  • ✅ Docs clarify --prompt as required

I'll trigger a fresh review of the changes now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 25, 2026 14:01

All review feedback addressed in commit 2e79284

@marcusquinn
Copy link
Copy Markdown
Owner Author

Closing: superseded by PR #2267 (merged as 7beb161), which included the t1329 cross-review judge pipeline alongside t1328. The review feedback from this PR (CodeRabbit, Gemini, Augment) was addressed in the final commits of #2267. Merge conflicts are due to the overlapping scope already landed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-review Flagged for human review by AI supervisor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant