Skip to content

Comments

[tool] fix: add timeout for MCP tool calls#5361

Open
denismegerle wants to merge 1 commit intoverl-project:mainfrom
denismegerle:public/mcp-client-timeout
Open

[tool] fix: add timeout for MCP tool calls#5361
denismegerle wants to merge 1 commit intoverl-project:mainfrom
denismegerle:public/mcp-client-timeout

Conversation

@denismegerle
Copy link
Contributor

@denismegerle denismegerle commented Feb 20, 2026

What does this PR do?

Prevent hangs when calling MCP tools by enforcing a timeout around call_tool_mcp in MCPClientManager.call_tool().

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+McpClientManager+wait_for+call_tool_mcp+timeout
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always (local) passed.

API and Usage Example

No API change beyond existing timeout parameter behavior. On timeout, asyncio.TimeoutError will be raised.

# Example usage
result = await ClientManager.call_tool(
    tool_name="some_tool",
    parameters={"foo": "bar"},
    timeout=30,
)

Design & Code Changes

  • Wrap client.call_tool_mcp(...) in asyncio.wait_for(..., timeout=timeout).

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • (not needed) Add / Update the documentation.
  • (not needed) Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: the change is a one-line timeout wrapper around a networked MCP tool call; a meaningful test would need a hermetic fastmcp transport/mock or an integration harness, which isn’t currently present here.
  • Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
  • (not relevant) If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

@denismegerle denismegerle marked this pull request as ready for review February 20, 2026 15:23
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a timeout for MCP tool calls using asyncio.wait_for, which is a necessary improvement to prevent indefinite hangs. However, the current implementation only wraps the tool execution itself. To fully address the risk of hangs, the timeout should ideally encompass the connection phase and the rate-limiting wait, as these stages can also block indefinitely in networked or congested environments.

client = self.get_client_with_tool_name(tool_name)
async with client:
return await client.call_tool_mcp(tool_name, parameters)
return await asyncio.wait_for(client.call_tool_mcp(tool_name, parameters), timeout=timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The timeout is currently applied only to the client.call_tool_mcp execution. It does not cover the async with client block, which handles the connection to the MCP server. If the connection itself hangs (e.g., due to network issues or server unresponsiveness during the handshake), the tool call will still block indefinitely, failing the PR's objective. Additionally, the rate-limiting wait (lines 60-61) is also outside the timeout. Consider wrapping the entire operation in a way that the timeout covers both connection and execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant