t1329: Cross-review judge pipeline and /cross-review slash command by marcusquinn · Pull Request #2269 · marcusquinn/aidevops

marcusquinn · 2026-02-25T02:50:42Z

Wire existing cross-review output into automated scoring via a judge model.

Changes

Add --score flag to cmd_cross_review() in compare-models-helper.sh
New _judge_cross_review() function dispatches all model outputs to a configurable judge model (default: opus) via Anthropic API, returning structured JSON scores (correctness, completeness, quality, clarity, overall, winner + reasoning)
New _resolve_cross_review_auth() for ANTHROPIC_API_KEY / OpenCode OAuth auth (matches routine-scheduler.sh pattern)
Judge scores recorded in model-comparisons SQLite DB via cmd_score()
Winner/loser data synced to pattern tracker via ab-compare (source: cross-review-judge)
New /cross-review slash command at .agents/scripts/commands/cross-review.md
cmd_help() updated with new flags and examples
ShellCheck clean, all 45 response-scoring tests pass

Usage

# Basic diff (existing behaviour, unchanged)
compare-models-helper.sh cross-review --prompt "Review this code" --models "sonnet,opus"

# Full judge pipeline (new)
compare-models-helper.sh cross-review --prompt "Review this PR diff" --models "sonnet,opus,pro" --score

# Custom judge and task type
compare-models-helper.sh cross-review --prompt "Audit architecture" --models "sonnet,opus" --score --judge opus --task-type analysis

Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus before commit).

Ref #2262

Summary by CodeRabbit

New Features

Multi-model cross-review now supports optional judge-based scoring to evaluate and compare outputs across models
Configurable judge model and task type selection for custom scoring criteria
Judge scores stored and can integrate with pattern tracker for AB testing workflows

- Add _resolve_cross_review_auth() for ANTHROPIC_API_KEY / OAuth auth - Add _judge_cross_review() that dispatches all model outputs to a judge model (default: opus) via Anthropic API, returns structured JSON scores (correctness, completeness, quality, clarity, overall, winner) - Extend cmd_cross_review() with --score, --judge, --task-type flags - Wire judge output into cmd_score() -> model-comparisons SQLite DB - Sync winner/loser data to pattern tracker via ab-compare (source: cross-review-judge) - Create /cross-review slash command at .agents/scripts/commands/cross-review.md - Update cmd_help() with new flags and examples - ShellCheck clean, all 45 response-scoring tests pass Inspired by Ouroboros multi-model adversarial review (o3/Gemini/Claude consensus). Chose Anthropic API direct call (matches routine-scheduler.sh pattern) over runner-helper dispatch for the judge — simpler, no runner lifecycle overhead for a single synchronous call.

gemini-code-assist · 2026-02-25T02:50:59Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AI model cross-review functionality by integrating an automated judge scoring pipeline. It allows for objective evaluation of multiple AI model responses against a given prompt, recording detailed scores and identifying a winner. This system not only provides structured feedback but also feeds into a pattern tracker to improve future model routing decisions, making the model comparison process more robust and data-driven.

Highlights

Automated Judge Scoring Pipeline: Introduced a new --score flag to cmd_cross_review() that enables an automated judge scoring pipeline. This dispatches model outputs to a configurable judge model (default: Opus) via the Anthropic API, returning structured JSON scores for correctness, completeness, quality, clarity, and overall assessment.
Anthropic API Authentication: Added a new _resolve_cross_review_auth() function to handle Anthropic API authentication, supporting both ANTHROPIC_API_KEY environment variable and OpenCode OAuth tokens, ensuring secure access to the judge model.
Data Persistence and Pattern Tracking: Integrated the judge scores into the model-comparisons SQLite database via cmd_score() and synced winner/loser data to the pattern tracker using ab-compare, facilitating data-driven model routing and performance analysis.
New Slash Command: Created a new /cross-review slash command, providing a user-friendly interface to trigger multi-model adversarial reviews with optional judge scoring directly from the command line.
Enhanced cmd_help(): Updated the cmd_help() function with new flags and examples for the cross-review command, improving user guidance for the newly added scoring capabilities.

Changelog

.agents/scripts/commands/cross-review.md
- Added a new markdown file defining the /cross-review slash command.
- Documented the command's purpose, arguments (--prompt, --models, --score, --judge, --task-type, --timeout, --output), and usage examples.
- Detailed the judge scoring criteria and the expected output format.
.agents/scripts/compare-models-helper.sh
- Updated the cmd_cross_review description to reflect the new judge scoring capability.
- Implemented _resolve_cross_review_auth() to manage Anthropic API key and OAuth authentication.
- Added _judge_cross_review() to orchestrate the judge model API call, prompt construction, and JSON response parsing for scoring.
- Modified cmd_cross_review() to parse new command-line arguments: --score, --judge, and --task-type.
- Integrated the judge scoring logic within cmd_cross_review(), including calling the judge model, saving results, displaying scores, and recording them to the model-comparisons database.
- Added logic to sync winner/loser data and normalized scores to the pattern tracker via pattern-tracker-helper.sh.
- Expanded cmd_help() with new examples and descriptions for the --score, --judge, and --task-type options.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-25T02:51:00Z

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between f9d555f and 2415944.

📒 Files selected for processing (3)

.agents/scripts/commands/cross-review.md
.agents/scripts/compare-models-helper.sh
todo/tasks/t1329-brief.md

Walkthrough

Introduces a multi-model cross-review workflow with judge-based scoring capabilities. The feature allows parallel model comparison, automatic diff computation, and optional scoring through a configurable judge model that normalizes results and integrates with pattern tracking and persistent storage.

Changes

Cohort / File(s)	Summary
Documentation `.agents/scripts/commands/cross-review.md`	New comprehensive guide documenting the cross-review tool, including argument parsing, usage modes (basic and judge-scored), sequence of operations, results viewing, scoring criteria, examples, and related commands.
Core Cross-Review Implementation `.agents/scripts/compare-models-helper.sh`	Added `_resolve_cross_review_auth()` for authentication handling and `_judge_cross_review()` for judge subsystem logic. Extended `cmd_cross_review()` with --score, --judge, and --task-type flags. Integrated judge-based scoring flow including API invocation, JSON result extraction, score normalization (1-10 to 1-5), pattern-tracker integration for A/B comparisons, and persistent storage to judge-scores.json. Updated help documentation with new options and examples.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CrossReview as Cross-Review<br/>(cmd_cross_review)
    participant Judge as Judge Model<br/>API
    participant PatternTracker as Pattern<br/>Tracker
    participant Storage as Judge<br/>Storage

    User->>CrossReview: Invoke with --score flag
    CrossReview->>CrossReview: Dispatch prompt to multiple models
    CrossReview->>CrossReview: Collect outputs & compute diffs
    CrossReview->>Judge: Invoke judge with prompt + outputs
    Judge->>Judge: Analyze & score models
    Judge-->>CrossReview: Return JSON scores (1-10)
    CrossReview->>CrossReview: Normalize scores (1-10 → 1-5)
    CrossReview->>CrossReview: Update model-scoring display
    CrossReview->>PatternTracker: Feed A/B comparison data
    PatternTracker-->>CrossReview: Pattern tracking complete
    CrossReview->>Storage: Persist judge outputs
    Storage-->>CrossReview: Stored to judge-scores.json
    CrossReview-->>User: Return results & scoring

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

feat: add comparison scoring framework to compare-models-helper (t168.3) #783 — Introduces the scoring framework (cmd_score) that the judge-based workflow directly integrates with for persistent result storage.
feat: Claude-Flow inspired features - model routing, semantic memory, pattern tracking (t102) #341 — Adds pattern-tracker helper integration for A/B comparisons, which the new scoring flow invokes with normalized judge scores.

Poem

🤖 Multiple minds convene to compare,
A judge weighs outputs with expert care,
Scores flow like streams from 1 to 10,
Then normalized, stored safe again—
Cross-review wisdom, transparent and fair. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main changes: implementing a cross-review judge pipeline and slash command, directly aligned with the PR objectives and file modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1329

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-25T02:51:15Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 58 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 25 02:51:10 UTC 2026: Code review monitoring started
Wed Feb 25 02:51:11 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 58

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 58
VULNERABILITIES: 0

Generated on: Wed Feb 25 02:51:13 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-25T02:52:04Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request introduces a significant new feature: an automated judge pipeline for cross-model reviews, complete with a new /cross-review slash command. The implementation is comprehensive, covering API authentication, parallel model execution, dispatching to a judge model, and recording scores in a database.

My review focuses on improving the script's robustness and error handling. I've identified a few places where error output is suppressed (2>/dev/null), which could hide critical issues with API calls or database writes. I've also suggested a safer way to pass variables to jq and consolidated multiple jq calls for better performance and reliability. These changes will make the new pipeline more reliable and easier to debug.

gemini-code-assist · 2026-02-25T02:52:56Z

+	response=$(curl -s --max-time 120 \
+		-H "Content-Type: application/json" \
+		-H "anthropic-version: 2023-06-01" \
+		-H "${auth_header}" \
+		"${curl_extra_args[@]}" \
+		-d "$request_body" \
+		"https://api.anthropic.com/v1/messages" 2>/dev/null) || {


Suppressing stderr for the curl command with 2>/dev/null is risky as it hides important errors related to network connectivity, authentication, or API issues. The -s flag is already used to silence progress output. Removing the redirection will allow critical error messages to be visible for debugging, which is important even with the || check that follows.

Suggested change

response=$(curl -s --max-time 120 \

-H "Content-Type: application/json" \

-H "anthropic-version: 2023-06-01" \

-H "${auth_header}" \

"${curl_extra_args[@]}" \

-d "$request_body" \

"https://api.anthropic.com/v1/messages" 2>/dev/null) || {

response=$(curl -s --max-time 120 \

-H "Content-Type: application/json" \

-H "anthropic-version: 2023-06-01" \

-H "${auth_header}" \

"${curl_extra_args[@]}" \

-d "$request_body" \

"https://api.anthropic.com/v1/messages") || {

References

Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

gemini-code-assist · 2026-02-25T02:52:57Z

+
+		# Record scores in model-comparisons DB via cmd_score
+		echo "Recording scores in model-comparisons DB..."
+		if cmd_score "${score_args[@]}" 2>/dev/null; then


Suppressing stderr from cmd_score with 2>/dev/null can hide important errors, such as database connection or write failures. These errors should be visible to diagnose problems with score recording.

Suggested change

if cmd_score "${score_args[@]}" 2>/dev/null; then

if cmd_score "${score_args[@]}"; then

References

Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

gemini-code-assist · 2026-02-25T02:52:57Z

+			m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null)
+			m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null)
+			m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null)
+			m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null)
+			m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null)
+			m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null)
+			m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null)


These repeated jq calls have multiple issues:

Error Suppression: Suppressing stderr with 2>/dev/null violates a general project rule. If the judge model returns malformed JSON, all parsing errors will be silenced, making debugging difficult. Errors should be allowed to surface.

Variable Injection: Injecting $model_name directly into the jq filter string is not robust. It's safer to use jq --arg to pass shell variables, which prevents parsing errors if the variable contains special characters.

Efficiency: Multiple jq calls can be inefficient. Consolidating these into a single jq invocation improves performance and maintainability.

The suggested code addresses all these points by using jq --arg, removing 2>/dev/null, and consolidating the value extraction into a single pass.

Suggested change

m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null)

m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null)

m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null)

m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null)

m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null)

m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null)

m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null)

read -r m_corr m_comp m_qual m_clar m_overall m_str m_wea <<< "$(echo "$judge_json" | jq -r --arg model_name "$model_name" '

.scores[$model_name] |

[

(.correctness // 0),

(.completeness // 0),

(.quality // 0),

(.clarity // 0),

(.overall // 0),

(.strengths // ""),

(.weaknesses // "")

] | @tsv

')"

References

In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.

In shell scripts, use jq --arg for strings and --argjson for other JSON types (like numbers) to safely pass variables into a jq filter. This avoids syntax errors if the variables contain special characters.

Consolidate multiple 'jq' calls into a single pass where possible to improve performance and script efficiency.

augmentcode · 2026-02-25T02:53:33Z

🤖 Augment PR Summary

Summary: Adds an automated “judge” scoring step to the existing cross-review pipeline and exposes it via a new /cross-review slash command.

Changes:

Extends compare-models-helper.sh cross-review with --score, --judge, and --task-type flags.
Adds auth resolution for Anthropic (env API key or OpenCode OAuth auth.json).
Implements a judge dispatch that sends the original prompt + all model outputs to a configurable judge model and expects structured JSON scores + winner.
Persists judge output to judge-scores.json in the run output directory.
Records scores into the model-comparisons SQLite DB via cmd_score.
Syncs winner/loser and normalized scores into the pattern tracker (source: cross-review-judge).
Adds documentation for the new /cross-review command and updates cmd_help() examples.

Technical Notes: Judge calls use the Anthropic Messages API (anthropic-version: 2023-06-01), default judge tier is opus, and scoring runs keep the existing diff/summary behavior while optionally adding the DB + pattern-tracker sync step.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-25T02:53:34Z

+	response=$(curl -s --max-time 120 \
+		-H "Content-Type: application/json" \
+		-H "anthropic-version: 2023-06-01" \
+		-H "${auth_header}" \


auth_header (API key / OAuth token) is passed to curl via -H, which typically exposes the secret in the process list (ps) on multi-user systems. Consider an approach that avoids putting credentials on the command line.

Severity: high

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-02-25T02:53:34Z

+
+	# Strip markdown code fences if present
+	local clean_json
+	clean_json=$(echo "$ai_text" | sed -n '/^{/,/^}/p' | head -100)


Using head -100 here can truncate a valid (pretty-printed/long) JSON response, causing jq validation to fail even when the judge returned correct JSON. That could make --score flaky as the number/size of model outputs grows.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-02-25T02:53:34Z

+				--completeness "$m_comp"
+				--quality "$m_qual"
+				--clarity "$m_clar"
+				--adherence "$m_overall"


cmd_score documents --adherence as instruction-following and also recomputes its own overall as an average including adherence; mapping judge overall into --adherence will skew/duplicate the overall score stored in the DB. This may make results/analytics misleading when comparing runs.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/commands/cross-review.md:
- Around line 13-15: The docs claim the review prompt can be positional but the
implementation of cmd_cross_review only accepts --prompt; either update the docs
to state --prompt is required or modify cmd_cross_review to accept a positional
prompt as well: if choosing docs, change the description to remove "positional"
and clarify "--prompt is required"; if choosing code, update cmd_cross_review's
argument parsing to treat a trailing positional argument as the prompt (fall
back to --prompt if provided) and ensure the help/usage and tests reflect the
new behavior.

In @.agents/scripts/compare-models-helper.sh:
- Around line 1090-1093: The current logic calls _judge_cross_review and on
failure logs with print_warning but then returns 0, causing a success exit;
change this to return a non-zero code so CI detects failure—specifically update
the failure branch that handles the _judge_cross_review call (the judge_json
assignment block) to return 1 (or propagate the failing exit status) instead of
return 0, preserving the existing print_warning and saved outputs behavior;
ensure this change is made in the function that contains the _judge_cross_review
invocation so --score requests fail fast when judge scoring fails.
- Around line 779-782: resolve_model_tier() can return provider-prefixed model
IDs which the Anthropic Messages API rejects; before sending judge_model_id to
Anthropic, strip any provider prefix (e.g., "anthropic/") so only the raw model
token (like "claude-opus-4-6") is used. Update the logic around judge_model_id
assignment (the resolve_model_tier call and its result) to detect and remove a
leading "<provider>/" segment (or use parameter expansion) so downstream code
that calls the Anthropic Messages API receives the raw model name; keep
resolve_model_tier and judge_model_id as the referenced symbols to locate the
change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8dce8 and f9d555f.

📒 Files selected for processing (2)

.agents/scripts/commands/cross-review.md
.agents/scripts/compare-models-helper.sh

marcusquinn · 2026-02-25T07:27:55Z

Flagged for Human Review

Reason: PR #2269 (t1329: cross-review judge pipeline) has 1 rejection review. Needs human decision on whether to address reviewer feedback or override. Two competing PRs exist (#2269 and #2267) covering overlapping scope (t1328+t1329).

This issue has been flagged by the AI supervisor for human review. Please assess and take appropriate action.

Flagged by AI Supervisor (automated reasoning cycle)

- Strip provider prefix from judge_model_id before Anthropic API call - Return non-zero exit code (return 1) when judge scoring fails - Remove 2>/dev/null from curl, jq, and cmd_score calls for debuggability - Consolidate multiple jq calls into single passes using --arg for safety - Remove head -100 truncation that could break valid JSON responses - Add adherence criterion to judge prompt and fix score mapping - Update cross-review.md: remove incorrect 'positional' prompt docs

- Pass curl auth headers via --config process substitution instead of -H flags to prevent credential exposure in process list (Augment review) - Remove head -100 truncation on diff output that could cause flaky results with large model outputs (Augment review) - Remove unnecessary 2>/dev/null on diff command

marcusquinn · 2026-02-25T13:57:45Z

All review feedback has been addressed:

Already fixed in prior commit (2e79284):

Provider prefix stripping for Anthropic API model IDs (CodeRabbit)
Judge failure now returns non-zero exit code (CodeRabbit)
Consolidated jq calls with --arg for safe variable passing (Gemini)
Removed 2>/dev/null from curl and cmd_score calls (Gemini)
Docs updated: --prompt is required, no positional (CodeRabbit)
Dedicated adherence field from judge JSON, not mapped from overall (Augment)

Fixed in this commit (2415944):

Auth headers passed via curl --config <(...) process substitution to avoid credential exposure in ps (Augment)
Removed head -100 truncation on diff output that could cause flaky results with large model outputs (Augment)
Removed unnecessary 2>/dev/null on diff command

@coderabbitai please re-review

coderabbitai · 2026-02-25T13:58:06Z

@marcusquinn: Thank you for the comprehensive fixes! I can see you've addressed all the feedback systematically:

Security & reliability improvements:

✅ Auth headers now protected via process substitution (no credential exposure in ps)
✅ Judge failures properly return non-zero exit codes
✅ Safe variable passing with jq --arg

Output handling improvements:

✅ Removed head -100 truncation that could cause flaky results
✅ Proper error visibility (removed unnecessary 2>/dev/null)

API & documentation improvements:

✅ Provider prefix stripping for Anthropic models
✅ Dedicated adherence field handling
✅ Docs clarify --prompt as required

I'll trigger a fresh review of the changes now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

All review feedback addressed in commit 2e79284

marcusquinn · 2026-02-25T14:09:30Z

Closing: superseded by PR #2267 (merged as 7beb161), which included the t1329 cross-review judge pipeline alongside t1328. The review feedback from this PR (CodeRabbit, Gemini, Augment) was addressed in the final commits of #2267. Merge conflicts are due to the overlapping scope already landed on main.

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

augmentcode bot reviewed Feb 25, 2026

View reviewed changes

coderabbitai bot previously requested changes Feb 25, 2026

View reviewed changes

Comment thread .agents/scripts/commands/cross-review.md

Comment thread .agents/scripts/compare-models-helper.sh

Comment thread .agents/scripts/compare-models-helper.sh

marcusquinn mentioned this pull request Feb 25, 2026

t1329: Cross-review judge pipeline and /cross-review slash command #2262

Closed

6 tasks

marcusquinn added the needs-review Flagged for human review by AI supervisor label Feb 25, 2026

This was referenced Feb 25, 2026

t1328: Matterbridge agent for multi-platform chat bridging #2254

Closed

[Supervisor:marcusquinn] 0 queued, 0 working, 7 in review, 7 blocked at 09:17 UTC #2199

Closed

marcusquinn added 2 commits February 25, 2026 13:55

marcusquinn closed this Feb 25, 2026

	if cmd_score "${score_args[@]}" 2>/dev/null; then
	if cmd_score "${score_args[@]}"; then

-			m_corr=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].correctness // 0" 2>/dev/null)
-			m_comp=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].completeness // 0" 2>/dev/null)
-			m_qual=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].quality // 0" 2>/dev/null)
-			m_clar=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].clarity // 0" 2>/dev/null)
-			m_overall=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].overall // 0" 2>/dev/null)
-			m_str=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].strengths // \"\"" 2>/dev/null)
-			m_wea=$(echo "$judge_json" | jq -r ".scores[\"${model_name}\"].weaknesses // \"\"" 2>/dev/null)
+read -r m_corr m_comp m_qual m_clar m_overall m_str m_wea <<< "$(echo "$judge_json" | jq -r --arg model_name "$model_name" '
+        .scores[$model_name] |
+        [
+            (.correctness // 0),
+            (.completeness // 0),
+            (.quality // 0),
+            (.clarity // 0),
+            (.overall // 0),
+            (.strengths // ""),
+            (.weaknesses // "")
+        ] | @tsv
+    ')"

Conversation

marcusquinn commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

Summary by CodeRabbit

New Features

Uh oh!

gemini-code-assist bot commented Feb 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

github-actions bot commented Feb 25, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 25, 2026

Quality Gate passed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Feb 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcusquinn commented Feb 25, 2026

Flagged for Human Review

Uh oh!

marcusquinn commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Uh oh!

marcusquinn commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading