Add test case for anyOf union flattening bug in MCP schemas#1838
Add test case for anyOf union flattening bug in MCP schemas#1838
Conversation
Eval 254 tests that anyOf unions in MCP tool schemas are preserved,
not flattened to the first non-null branch. The MCP server exposes a
tool with two object shapes in anyOf:
Shape A: {level: string} -> escalation policies
Shape B: {group: string, active: bool} -> on-call members + verification code
After flattening, only Shape A is visible. The user prompt requires Shape B,
which is undiscoverable without the anyOf schema. Verified RED: 0% pass rate
with Opus 4.6 via OpenRouter.
https://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H
Signed-off-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review.
Tip: disable this comment in your organization's Code Review settings.
📂 Previous Runs📜 #1 · Run @ __361162a__ (#23482277908) — Mar 24, 09:33 UTC✅ Results of HolmesGPT evalsAutomatically triggered by commit 361162a on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
|
| Status | Test case | Time | Turns | Tools | Cost | Total tokens | Input | Max input | Output | Max output | Cached | Non-cached | Reasoning | Compactions |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ❌ | 254_mcp_anyof_union_flattening | 21.9s | 4 | 6 | $0.1438 | 63,114 | 62,364 | 16,318 | 750 | 252 | 46,034 | 16,330 | — | — |
| Total | 21.9s avg | 4.0 avg | 6.0 avg | $0.1438 | 63,114 | 62,364 | 16,318 | 750 | 252 | 46,034 | 16,330 | — | — |
Benchmark comparison unavailable: No ci-benchmark experiments found
Benchmark Comparison Details
Baseline: latest ci-benchmark experiment on master
Status: No ci-benchmark experiments found
Comparison indicators:
±0%— diff under 10% (within noise threshold)↑N%/↓N%— diff 10-25%↑N%/↓N%— diff over 25% (significant)
⚠️ 1 Failure Detected
📖 Legend
| Icon | Meaning |
|---|---|
| ✅ | The test was successful |
| ➖ | The test was skipped |
| The test failed but is known to be flaky or known to fail | |
| 🚧 | The test had a setup failure (not a code regression) |
| 🔧 | The test failed due to mock data issues (not a code regression) |
| 🚫 | The test was throttled by API rate limits/overload |
| ❌ | The test failed and should be fixed before merging the PR |
🔄 Re-run evals manually
⚠️ Warning:/evalcomments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=
Option 1: Comment on this PR with /eval:
/eval
tags: regression
Or with more options (one per line):
/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5
Run evals on a different branch (e.g., master) for comparison:
/eval
branch: master
tags: regression
| Option | Description |
|---|---|
model |
Model(s) to test (default: same as automatic runs) |
tags |
Pytest tags / markers (no default - runs all tests!) |
id |
Eval ID / pytest -k filter (use /list to see valid eval names) |
iterations |
Number of runs, max 10 |
branch |
Run evals on a different branch (for cross-branch comparison) |
Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.
Option 2: Trigger via GitHub Actions UI → "Run workflow"
Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):
| Label | Effect |
|---|---|
evals-tag-<name> |
Run tests with tag <name> alongside regression |
evals-id-<name> |
Run a specific eval by test ID |
evals-model-<name> |
Override the model (use model list name, e.g. sonnet-4.5) |
Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5
🏷️ Valid tags
benchmark, chain-of-causation, compaction, confluence, context_window, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes, leaked-information, logs, loki, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency
🤖 Valid models
deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6
Commands: /eval · /rerun · /list
CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=
WalkthroughAdds a new test fixture directory with an MCP stdio server that implements tool-based query routing, a test case file that validates the routing behavior, and a toolsets configuration file that registers the routing server as an enabled MCP toolset with stdio mode. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Docker images ready for
Use these tags to pull the images for testing. 📋 Copy commandsgcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20Patch Helm values in one line (choose the chart you use): HolmesGPT chart: helm upgrade --install holmesgpt ./helm/holmes \
--set registry=me-west1-docker.pkg.dev/robusta-development/development \
--set image=holmes-dev:3fc6fb20 \
--set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set operator.image=holmes-operator-dev:3fc6fb20Robusta wrapper chart: helm upgrade --install robusta robusta/robusta \
--reuse-values \
--set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.image=holmes-dev:3fc6fb20 \
--set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.operator.image=holmes-operator-dev:3fc6fb20 |
✅ Deploy Preview for holmes-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py (1)
88-92: Consider more specific type annotation.For consistency with the codebase's type hint requirements, the parameter could use a more specific type.
📝 Suggested type hint refinement
-def _handle_query_routing(arguments: dict) -> str: +def _handle_query_routing(arguments: dict[str, Any]) -> str:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py` around lines 88 - 92, The function _handle_query_routing currently types its parameter as a plain dict; change the signature to a more specific typing annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and add the corresponding import from typing, then update the runtime type check (selector retrieval remains the same but if you switch to Mapping consider checking isinstance(selector, Mapping) or keep dict check depending on expected concrete types) so the signature and checks match the chosen type hint (refer to _handle_query_routing, selector variable, and the isinstance check).tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml (1)
29-29: Consider whetherinclude_tool_calls: trueis necessary.The verification code
ONCALL-EVAL-9r4w7zinexpected_outputis already specific enough to confirm the LLM used Shape B—if the code is present, the correct tool call was made. However, keepinginclude_tool_calls: trueis reasonable here since it aids debugging when the test is RED and helps verify the exact tool parameters used.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml` at line 29, The test currently sets include_tool_calls: true which is likely redundant because the unique verification code in expected_output already confirms the Shape B tool call; remove the include_tool_calls: true line from the test YAML to reduce noise, or if you want to retain it for easier debugging leave it but add an inline comment next to include_tool_calls (or a short docstring in the test) explaining that it is intentionally kept for RED-run debugging to avoid confusion; locate the flag by name (include_tool_calls) in the test_case.yaml and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`:
- Around line 88-92: The function _handle_query_routing currently types its
parameter as a plain dict; change the signature to a more specific typing
annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and
add the corresponding import from typing, then update the runtime type check
(selector retrieval remains the same but if you switch to Mapping consider
checking isinstance(selector, Mapping) or keep dict check depending on expected
concrete types) so the signature and checks match the chosen type hint (refer to
_handle_query_routing, selector variable, and the isinstance check).
In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`:
- Line 29: The test currently sets include_tool_calls: true which is likely
redundant because the unique verification code in expected_output already
confirms the Shape B tool call; remove the include_tool_calls: true line from
the test YAML to reduce noise, or if you want to retain it for easier debugging
leave it but add an inline comment next to include_tool_calls (or a short
docstring in the test) explaining that it is intentionally kept for RED-run
debugging to avoid confusion; locate the flag by name (include_tool_calls) in
the test_case.yaml and apply the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d2abfc08-7a28-4e10-a5a1-844b5af2ac8d
📒 Files selected for processing (3)
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.pytests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yamltests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/toolsets.yaml
Summary
This PR adds a test fixture to verify HolmesGPT's handling of
anyOfunion types in MCP tool schemas. The test is designed to expose a known bug where union types are incorrectly flattened to only the first schema variant, making alternative parameter shapes unreachable.Changes
mcp_routing_server.py: New MCP server implementation that exposes a
query_routingtool with aselectorparameter containing two distinct object shapes viaanyOf:{level: string}→ returns escalation policy information{group: string, active: boolean}→ returns on-call team members with a verification codetest_case.yaml: Test specification that validates the LLM can:
anyOfschemaONCALL-EVAL-9r4w7z(only returned by Shape B)toolsets.yaml: Configuration enabling the alert-routing MCP toolset for this test
Implementation Details
validate_input=Falseflag on the MCP handler allows testing of schema discovery without premature validationhttps://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H
Summary by CodeRabbit