Skip to content

Add test case for anyOf union flattening bug in MCP schemas#1838

Open
aantn wants to merge 1 commit intomasterfrom
claude/fix-mcp-union-flattening-G0eMq
Open

Add test case for anyOf union flattening bug in MCP schemas#1838
aantn wants to merge 1 commit intomasterfrom
claude/fix-mcp-union-flattening-G0eMq

Conversation

@aantn
Copy link
Collaborator

@aantn aantn commented Mar 24, 2026

Summary

This PR adds a test fixture to verify HolmesGPT's handling of anyOf union types in MCP tool schemas. The test is designed to expose a known bug where union types are incorrectly flattened to only the first schema variant, making alternative parameter shapes unreachable.

Changes

  • mcp_routing_server.py: New MCP server implementation that exposes a query_routing tool with a selector parameter containing two distinct object shapes via anyOf:

    • Shape A: {level: string} → returns escalation policy information
    • Shape B: {group: string, active: boolean} → returns on-call team members with a verification code
    • Shape B's parameter names are intentionally omitted from descriptions to force discovery via schema parsing alone
  • test_case.yaml: Test specification that validates the LLM can:

    • Discover and use the Shape B selector variant from the anyOf schema
    • Query team-alpha's active rotation members
    • Extract the verification code ONCALL-EVAL-9r4w7z (only returned by Shape B)
    • Report the correct person (Alice Chen) and role (senior-sre)
  • toolsets.yaml: Configuration enabling the alert-routing MCP toolset for this test

Implementation Details

  • The test is intentionally marked as RED (expected to fail) using TDD methodology
  • The verification code is only obtainable through Shape B, making hallucination impossible and clearly indicating whether the LLM successfully used the correct schema variant
  • Static test data includes realistic on-call schedules and escalation policies
  • The validate_input=False flag on the MCP handler allows testing of schema discovery without premature validation

https://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H

Summary by CodeRabbit

  • Tests
    • Added test fixtures for MCP tool schema validation with union-type parameters
    • Includes comprehensive test cases for tool routing and parameter handling with complex schema structures

Eval 254 tests that anyOf unions in MCP tool schemas are preserved,
not flattened to the first non-null branch. The MCP server exposes a
tool with two object shapes in anyOf:
  Shape A: {level: string} -> escalation policies
  Shape B: {group: string, active: bool} -> on-call members + verification code

After flattening, only Shape A is visible. The user prompt requires Shape B,
which is undiscoverable without the anyOf schema. Verified RED: 0% pass rate
with Opus 4.6 via OpenRouter.

https://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H
Signed-off-by: Claude <noreply@anthropic.com>
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

Tip: disable this comment in your organization's Code Review settings.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

📂 Previous Runs

📜 #1 · Run @ __361162a__ (#23482277908) — Mar 24, 09:33 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 361162a on branch claude/fix-mcp-union-flattening-G0eMq

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 10/10 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
09_crashpod 31.0s 4 9 $0.2288 82,775 80,842 23,145 1,933 973 56,148 24,694
101_loki_historical_logs_pod_deleted 60.8s 7 15 $0.3322 170,708 167,439 28,374 3,269 903 138,274 29,165
112_find_pvcs_by_uuid 24.2s 5 4 $0.1999 96,123 94,846 20,961 1,277 328 73,872 20,974
12_job_crashing 34.3s 5 12 $0.2536 111,564 109,573 24,775 1,991 531 83,341 26,232
176_network_policy_blocking_traffic_no_runbooks 39.9s 6 10 $0.3863 124,116 121,990 23,434 2,126 405 74,347 47,643
227_count_configmaps_per_namespace[0] 23.5s 4 9 $0.1949 78,021 76,774 21,068 1,247 586 54,765 22,009
243_pod_names_contain_service 35.7s 5 9 $0.2312 101,999 99,941 22,552 2,058 596 77,376 22,565
24_misconfigured_pvc 36.5s 5 13 $0.2487 104,532 102,236 23,353 2,296 663 77,565 24,671
43_current_datetime_from_prompt 4.8s 1 $0.1100 17,207 17,078 17,078 129 129 0 17,078
61_exact_match_counting 12.0s 3 3 $0.1399 53,309 52,931 18,069 378 231 34,851 18,080
Total 30.3s avg 4.5 avg 9.3 avg $2.3253 940,354 923,650 28,374 16,704 973 670,539 253,111

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)

⚠️ Eval Results (with failures)

Automatically triggered by commit 361162a on branch claude/fix-mcp-union-flattening-G0eMq (labels: evals-id-254)

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 0/1 test cases were successful, 1 regressions
Status Test case Time Turns Tools Cost Total tokens Input Max input Output Max output Cached Non-cached Reasoning Compactions
254_mcp_anyof_union_flattening 21.9s 4 6 $0.1438 63,114 62,364 16,318 750 252 46,034 16,330
Total 21.9s avg 4.0 avg 6.0 avg $0.1438 63,114 62,364 16,318 750 252 46,034 16,330

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)

⚠️ 1 Failure Detected

📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
tags: regression

Or with more options (one per line):

/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
tags: regression
Option Description
model Model(s) to test (default: same as automatic runs)
tags Pytest tags / markers (no default - runs all tests!)
id Eval ID / pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):

Label Effect
evals-tag-<name> Run tests with tag <name> alongside regression
evals-id-<name> Run a specific eval by test ID
evals-model-<name> Override the model (use model list name, e.g. sonnet-4.5)

Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5

🏷️ Valid tags

benchmark, chain-of-causation, compaction, confluence, context_window, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes, leaked-information, logs, loki, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency

🤖 Valid models

deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

Walkthrough

Adds a new test fixture directory with an MCP stdio server that implements tool-based query routing, a test case file that validates the routing behavior, and a toolsets configuration file that registers the routing server as an enabled MCP toolset with stdio mode.

Changes

Cohort / File(s) Summary
MCP Routing Server
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py
Implements an MCP stdio server that exposes a query_routing tool. The tool accepts a selector parameter using anyOf schema with two object shapes: one with a level enum (critical/warning/info) and another with group string and active boolean. Routes incoming calls to group member lookups or level policy lookups based on selector shape, returning verification codes and role/policy details respectively.
Test Configuration
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml, tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/toolsets.yaml
Defines a test case evaluating active rotation member lookup for a team with expected verification code and role identification, and registers the alert-routing MCP toolset in stdio mode alongside disabled Kubernetes and Helm toolset stubs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • RoiGlinik
  • arikalon1
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main purpose of the changeset—adding a test case to verify handling of anyOf union flattening in MCP schemas.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

Docker images ready for 3fc6fb20 (built in 6m 55s)

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use these tags to pull the images for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:3fc6fb20 \
  --set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set operator.image=holmes-operator-dev:3fc6fb20

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:3fc6fb20 \
  --set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.operator.image=holmes-operator-dev:3fc6fb20

@netlify
Copy link

netlify bot commented Mar 24, 2026

Deploy Preview for holmes-docs ready!

Name Link
🔨 Latest commit 361162a
🔍 Latest deploy log https://app.netlify.com/projects/holmes-docs/deploys/69c258cc063f36000847cf2b
😎 Deploy Preview https://deploy-preview-1838--holmes-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py (1)

88-92: Consider more specific type annotation.

For consistency with the codebase's type hint requirements, the parameter could use a more specific type.

📝 Suggested type hint refinement
-def _handle_query_routing(arguments: dict) -> str:
+def _handle_query_routing(arguments: dict[str, Any]) -> str:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`
around lines 88 - 92, The function _handle_query_routing currently types its
parameter as a plain dict; change the signature to a more specific typing
annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and
add the corresponding import from typing, then update the runtime type check
(selector retrieval remains the same but if you switch to Mapping consider
checking isinstance(selector, Mapping) or keep dict check depending on expected
concrete types) so the signature and checks match the chosen type hint (refer to
_handle_query_routing, selector variable, and the isinstance check).
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml (1)

29-29: Consider whether include_tool_calls: true is necessary.

The verification code ONCALL-EVAL-9r4w7z in expected_output is already specific enough to confirm the LLM used Shape B—if the code is present, the correct tool call was made. However, keeping include_tool_calls: true is reasonable here since it aids debugging when the test is RED and helps verify the exact tool parameters used.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`
at line 29, The test currently sets include_tool_calls: true which is likely
redundant because the unique verification code in expected_output already
confirms the Shape B tool call; remove the include_tool_calls: true line from
the test YAML to reduce noise, or if you want to retain it for easier debugging
leave it but add an inline comment next to include_tool_calls (or a short
docstring in the test) explaining that it is intentionally kept for RED-run
debugging to avoid confusion; locate the flag by name (include_tool_calls) in
the test_case.yaml and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`:
- Around line 88-92: The function _handle_query_routing currently types its
parameter as a plain dict; change the signature to a more specific typing
annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and
add the corresponding import from typing, then update the runtime type check
(selector retrieval remains the same but if you switch to Mapping consider
checking isinstance(selector, Mapping) or keep dict check depending on expected
concrete types) so the signature and checks match the chosen type hint (refer to
_handle_query_routing, selector variable, and the isinstance check).

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`:
- Line 29: The test currently sets include_tool_calls: true which is likely
redundant because the unique verification code in expected_output already
confirms the Shape B tool call; remove the include_tool_calls: true line from
the test YAML to reduce noise, or if you want to retain it for easier debugging
leave it but add an inline comment next to include_tool_calls (or a short
docstring in the test) explaining that it is intentionally kept for RED-run
debugging to avoid confusion; locate the flag by name (include_tool_calls) in
the test_case.yaml and apply the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2abfc08-7a28-4e10-a5a1-844b5af2ac8d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b7d2aa and 361162a.

📒 Files selected for processing (3)
  • tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py
  • tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/toolsets.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants