Add test case for anyOf union flattening bug in MCP schemas by aantn · Pull Request #1838 · HolmesGPT/holmesgpt

aantn · 2026-03-24T09:26:33Z

Summary

This PR adds a test fixture to verify HolmesGPT's handling of anyOf union types in MCP tool schemas. The test is designed to expose a known bug where union types are incorrectly flattened to only the first schema variant, making alternative parameter shapes unreachable.

Changes

mcp_routing_server.py: New MCP server implementation that exposes a query_routing tool with a selector parameter containing two distinct object shapes via anyOf:
- Shape A: {level: string} → returns escalation policy information
- Shape B: {group: string, active: boolean} → returns on-call team members with a verification code
- Shape B's parameter names are intentionally omitted from descriptions to force discovery via schema parsing alone
test_case.yaml: Test specification that validates the LLM can:
- Discover and use the Shape B selector variant from the anyOf schema
- Query team-alpha's active rotation members
- Extract the verification code ONCALL-EVAL-9r4w7z (only returned by Shape B)
- Report the correct person (Alice Chen) and role (senior-sre)
toolsets.yaml: Configuration enabling the alert-routing MCP toolset for this test

Implementation Details

The test is intentionally marked as RED (expected to fail) using TDD methodology
The verification code is only obtainable through Shape B, making hallucination impossible and clearly indicating whether the LLM successfully used the correct schema variant
Static test data includes realistic on-call schedules and escalation policies
The validate_input=False flag on the MCP handler allows testing of schema discovery without premature validation

https://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H

Summary by CodeRabbit

Tests
- Added test fixtures for MCP tool schema validation with union-type parameters
- Includes comprehensive test cases for tool routing and parameter handling with complex schema structures

Eval 254 tests that anyOf unions in MCP tool schemas are preserved, not flattened to the first non-null branch. The MCP server exposes a tool with two object shapes in anyOf: Shape A: {level: string} -> escalation policies Shape B: {group: string, active: bool} -> on-call members + verification code After flattening, only Shape A is visible. The user prompt requires Shape B, which is undiscoverable without the anyOf schema. Verified RED: 0% pass rate with Opus 4.6 via OpenRouter. https://claude.ai/code/session_015TV8BrsgRZXvnZjXF5g94H Signed-off-by: Claude <noreply@anthropic.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

_{Tip: disable this comment in your organization's Code Review settings.}

github-actions · 2026-03-24T09:26:48Z

📂 Previous Runs

📜 #1 · Run @ __361162a__ (#23482277908) — Mar 24, 09:33 UTC

✅ Results of HolmesGPT evals

Automatically triggered by commit 361162a on branch claude/fix-mcp-union-flattening-G0eMq

View workflow logs

Results of HolmesGPT evals

ask_holmes: 10/10 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
✅	09_crashpod	31.0s	4	9	$0.2288	82,775	80,842	23,145	1,933	973	56,148	24,694	—	—
✅	101_loki_historical_logs_pod_deleted	60.8s	7	15	$0.3322	170,708	167,439	28,374	3,269	903	138,274	29,165	—	—
✅	112_find_pvcs_by_uuid	24.2s	5	4	$0.1999	96,123	94,846	20,961	1,277	328	73,872	20,974	—	—
✅	12_job_crashing	34.3s	5	12	$0.2536	111,564	109,573	24,775	1,991	531	83,341	26,232	—	—
✅	176_network_policy_blocking_traffic_no_runbooks	39.9s	6	10	$0.3863	124,116	121,990	23,434	2,126	405	74,347	47,643	—	—
✅	227_count_configmaps_per_namespace[0]	23.5s	4	9	$0.1949	78,021	76,774	21,068	1,247	586	54,765	22,009	—	—
✅	243_pod_names_contain_service	35.7s	5	9	$0.2312	101,999	99,941	22,552	2,058	596	77,376	22,565	—	—
✅	24_misconfigured_pvc	36.5s	5	13	$0.2487	104,532	102,236	23,353	2,296	663	77,565	24,671	—	—
✅	43_current_datetime_from_prompt	4.8s	1	—	$0.1100	17,207	17,078	17,078	129	129	0	17,078	—	—
✅	61_exact_match_counting	12.0s	3	3	$0.1399	53,309	52,931	18,069	378	231	34,851	18,080	—	—
	Total	30.3s avg	4.5 avg	9.3 avg	$2.3253	940,354	923,650	28,374	16,704	973	670,539	253,111	—	—

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

⚠️ Eval Results (with failures)

Automatically triggered by commit 361162a on branch claude/fix-mcp-union-flattening-G0eMq (labels: evals-id-254)

View workflow logs

Results of HolmesGPT evals

ask_holmes: 0/1 test cases were successful, 1 regressions

Status	Test case	Time	Turns	Tools	Cost	Total tokens	Input	Max input	Output	Max output	Cached	Non-cached	Reasoning	Compactions
❌	254_mcp_anyof_union_flattening	21.9s	4	6	$0.1438	63,114	62,364	16,318	750	252	46,034	16,330	—	—
	Total	21.9s avg	4.0 avg	6.0 avg	$0.1438	63,114	62,364	16,318	750	252	46,034	16,330	—	—

Benchmark comparison unavailable: No ci-benchmark experiments found

Benchmark Comparison Details

Baseline: latest ci-benchmark experiment on master

Status: No ci-benchmark experiments found

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

⚠️ 1 Failure Detected

📖 Legend

Icon	Meaning
✅	The test was successful
➖	The test was skipped
⚠️	The test failed but is known to be flaky or known to fail
🚧	The test had a setup failure (not a code regression)
🔧	The test failed due to mock data issues (not a code regression)
🚫	The test was throttled by API rate limits/overload
❌	The test failed and should be fixed before merging the PR

🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
tags: regression

Or with more options (one per line):

/eval
model: gpt-4o
tags: regression
id: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
tags: regression

Option	Description
`model`	Model(s) to test (default: same as automatic runs)
`tags`	Pytest tags / markers (no default - runs all tests!)
`id`	Eval ID / pytest -k filter (use `/list` to see valid eval names)
`iterations`	Number of runs, max 10
`branch`	Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

Option 3: Add PR labels to include extra evals (applies to both automatic runs and /eval comments):

Label	Effect
`evals-tag-<name>`	Run tests with tag `<name>` alongside regression
`evals-id-<name>`	Run a specific eval by test ID
`evals-model-<name>`	Override the model (use model list name, e.g. `sonnet-4.5`)

Examples: evals-tag-easy, evals-id-09_crashpod, evals-model-sonnet-4.5

🏷️ Valid tags

benchmark, chain-of-causation, compaction, confluence, context_window, coralogix, counting, database, datadog, datetime, db-connectors, easy, elasticsearch, embeds, fast, frontend, grafana, hard, images, integration, kafka, kubernetes, leaked-information, logs, loki, mcp, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency

🤖 Valid models

deepseek-chat, deepseek-r1-reasoner, deepseek-reasoner, deepseek-v3.2-chat, gemini-3-flash-preview, gemini-3-pro-preview, gemini-3.1-pro-preview, gpt-4.1, gpt-5.2-high-reasoning, gpt-5.3-codex, gpt-5.4, haiku-4.5, kimi-2.5, kimi-2.5-openrouter, opus-4.5, opus-4.6, qwen-next-80B-instruct, qwen-next-80B-thinking, sonnet-4.5, sonnet-4.6

Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/fix-mcp-union-flattening-G0eMq -f markers=regression -f filter=

coderabbitai · 2026-03-24T09:26:52Z

Walkthrough

Adds a new test fixture directory with an MCP stdio server that implements tool-based query routing, a test case file that validates the routing behavior, and a toolsets configuration file that registers the routing server as an enabled MCP toolset with stdio mode.

Changes

Cohort / File(s)	Summary
MCP Routing Server `tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`	Implements an MCP stdio server that exposes a `query_routing` tool. The tool accepts a `selector` parameter using `anyOf` schema with two object shapes: one with a `level` enum (critical/warning/info) and another with `group` string and `active` boolean. Routes incoming calls to group member lookups or level policy lookups based on selector shape, returning verification codes and role/policy details respectively.
Test Configuration `tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`, `tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/toolsets.yaml`	Defines a test case evaluating active rotation member lookup for a team with expected verification code and role identification, and registers the `alert-routing` MCP toolset in stdio mode alongside disabled Kubernetes and Helm toolset stubs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

RoiGlinik
arikalon1

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main purpose of the changeset—adding a test case to verify handling of anyOf union flattening in MCP schemas.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-24T09:26:56Z

✅ Docker images ready for 3fc6fb20 (built in 6m 55s)

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use these tags to pull the images for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:3fc6fb20
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:3fc6fb20 me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:3fc6fb20

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:3fc6fb20 \
  --set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set operator.image=holmes-operator-dev:3fc6fb20

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:3fc6fb20 \
  --set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.operator.image=holmes-operator-dev:3fc6fb20

netlify · 2026-03-24T09:27:02Z

✅ Deploy Preview for holmes-docs ready!

Name	Link
🔨 Latest commit	`361162a`
🔍 Latest deploy log	https://app.netlify.com/projects/holmes-docs/deploys/69c258cc063f36000847cf2b
😎 Deploy Preview	https://deploy-preview-1838--holmes-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai

🧹 Nitpick comments (2)

tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py (1)
88-92: Consider more specific type annotation.

For consistency with the codebase's type hint requirements, the parameter could use a more specific type.
📝 Suggested type hint refinement
-def _handle_query_routing(arguments: dict) -> str:
+def _handle_query_routing(arguments: dict[str, Any]) -> str:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`
around lines 88 - 92, The function _handle_query_routing currently types its
parameter as a plain dict; change the signature to a more specific typing
annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and
add the corresponding import from typing, then update the runtime type check
(selector retrieval remains the same but if you switch to Mapping consider
checking isinstance(selector, Mapping) or keep dict check depending on expected
concrete types) so the signature and checks match the chosen type hint (refer to
_handle_query_routing, selector variable, and the isinstance check).
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml (1)
29-29: Consider whether include_tool_calls: true is necessary.

The verification code ONCALL-EVAL-9r4w7z in expected_output is already specific enough to confirm the LLM used Shape B—if the code is present, the correct tool call was made. However, keeping include_tool_calls: true is reasonable here since it aids debugging when the test is RED and helps verify the exact tool parameters used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`
at line 29, The test currently sets include_tool_calls: true which is likely
redundant because the unique verification code in expected_output already
confirms the Shape B tool call; remove the include_tool_calls: true line from
the test YAML to reduce noise, or if you want to retain it for easier debugging
leave it but add an inline comment next to include_tool_calls (or a short
docstring in the test) explaining that it is intentionally kept for RED-run
debugging to avoid confusion; locate the flag by name (include_tool_calls) in
the test_case.yaml and apply the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py`:
- Around line 88-92: The function _handle_query_routing currently types its
parameter as a plain dict; change the signature to a more specific typing
annotation (e.g., arguments: Dict[str, Any] or arguments: Mapping[str, Any]) and
add the corresponding import from typing, then update the runtime type check
(selector retrieval remains the same but if you switch to Mapping consider
checking isinstance(selector, Mapping) or keep dict check depending on expected
concrete types) so the signature and checks match the chosen type hint (refer to
_handle_query_routing, selector variable, and the isinstance check).

In
`@tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml`:
- Line 29: The test currently sets include_tool_calls: true which is likely
redundant because the unique verification code in expected_output already
confirms the Shape B tool call; remove the include_tool_calls: true line from
the test YAML to reduce noise, or if you want to retain it for easier debugging
leave it but add an inline comment next to include_tool_calls (or a short
docstring in the test) explaining that it is intentionally kept for RED-run
debugging to avoid confusion; locate the flag by name (include_tool_calls) in
the test_case.yaml and apply the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2abfc08-7a28-4e10-a5a1-844b5af2ac8d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b7d2aa and 361162a.

📒 Files selected for processing (3)

tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/mcp_routing_server.py
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/test_case.yaml
tests/llm/fixtures/test_ask_holmes/254_mcp_anyof_union_flattening/toolsets.yaml

claude bot reviewed Mar 24, 2026

View reviewed changes

aantn added the evals-id-254 label Mar 24, 2026

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test case for anyOf union flattening bug in MCP schemas#1838

Add test case for anyOf union flattening bug in MCP schemas#1838
aantn wants to merge 1 commit intomasterfrom
claude/fix-mcp-union-flattening-G0eMq

aantn commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

claude bot left a comment

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited

Loading

✅ Results of HolmesGPT evals

Results of HolmesGPT evals

Uh oh!

coderabbitai bot commented Mar 24, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

netlify bot commented Mar 24, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aantn commented Mar 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Implementation Details

Summary by CodeRabbit

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📂 Previous Runs

✅ Results of HolmesGPT evals

Results of HolmesGPT evals

⚠️ Eval Results (with failures)

Results of HolmesGPT evals

⚠️ 1 Failure Detected

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Mar 24, 2026

✅ Deploy Preview for holmes-docs ready!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aantn commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Mar 24, 2026 •

edited

Loading

coderabbitai bot commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 24, 2026 •

edited

Loading