[tool, rollout] refactor: cache tool list initialization with MCP timeout by denismegerle · Pull Request #5426 · verl-project/verl

denismegerle · 2026-02-27T09:59:06Z

What does this PR do?

Add a small process-local cache for tool list initialization in ToolAgentLoop, and add a timeout for MCP tool initialization to avoid hanging indefinitely.

Currently in high concurrency situations the downstream mcp servers might congest, which leads to verl hanging indefinitely in some tool agent loop instances. To avoid this, this MR includes 1) retry for tool schema retrieval and 2) assuming fixed tool schemas, a caching mechanism per config file.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+initialize_tools_from_config+
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Added tests/utils/test_tool_list_cache_on_cpu.py for cache semantics (deep-copy isolation + retry/backoff).

API and Usage Example

# No code changes required for existing configs.
# Optional knobs via env vars:
# - VERL_TOOL_CACHE_INIT_RETRIES
# - VERL_TOOL_CACHE_INIT_TIMEOUT_S
# - VERL_TOOL_CACHE_INIT_BACKOFF_BASE_S
# - VERL_TOOL_CACHE_INIT_BACKOFF_MAX_S

Design & Code Changes

Add ToolListCache in verl/tools/utils/tool_registry.py keyed by tool config path.
Add mcp_init_timeout_s to initialize_tools_from_config(...) and enforce it on MCP init.
Switch ToolAgentLoop to use ToolListCache.get_tool_list(...).

Notes / trade-offs:

ToolListCache currently returns copy.deepcopy(...) of cached tool objects to avoid shared mutable state across loops. If any tool implementation is not deepcopy-safe (e.g. holds non-copyable resources), we can revise the cache to store schemas/configs instead of tool instances.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: The change is covered by a small unit test; integration coverage relies on existing agent loop tests.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
(not directly related) If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

- Updated the tool initialization process in ToolAgentLoop to utilize ToolListCache for improved efficiency and reduced redundancy. - Enhanced the initialize_tools_from_config function to support timeout handling during tool initialization, improving robustness. - Introduced a caching mechanism to store initialized tools, minimizing repeated initialization and optimizing performance.

Add a small unit test that validates ToolListCache caching semantics, deep-copy isolation, and retry/backoff behavior. Made-with: Cursor

CLAassistant · 2026-02-27T09:59:13Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Denis Megerle
❌ denismegerle

Denis Megerle seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces a process-local cache for tool list initialization and adds a timeout for MCP tool initialization, which are valuable improvements for performance and robustness. The implementation is mostly solid, with good test coverage for the new caching and retry logic. However, I've identified a significant performance bottleneck in the cache implementation related to lock contention. My review includes a detailed comment and a code suggestion to address this issue by refactoring the locking strategy to avoid holding a global lock during long-running I/O operations.

gemini-code-assist · 2026-02-27T10:02:59Z

verl/tools/utils/tool_registry.py

+    def get_tool_list(cls, tools_config_file: str | None) -> list:
+        if not tools_config_file:
+            return []
+
+        with cls._lock:
+            cached = cls._cache.get(tools_config_file)
+            if cached is None:
+                retries, timeout_s, backoff_base_s, backoff_max_s = cls._get_init_policy()
+                logger.debug(
+                    "[ToolListCache] cache miss for %s, initializing tools (retries=%s, timeout_s=%.2f)",
+                    tools_config_file,
+                    retries,
+                    timeout_s,
+                )
+                last_error = None
+                for attempt in range(1, retries + 1):
+                    try:
+                        cached = initialize_tools_from_config(tools_config_file, mcp_init_timeout_s=timeout_s)
+                        break
+                    except Exception as e:
+                        last_error = e
+                        if attempt >= retries:
+                            raise RuntimeError(
+                                f"[ToolListCache] failed to initialize tools for {tools_config_file} "
+                                f"after {retries} attempts"
+                            ) from e
+
+                        backoff_s = min(backoff_max_s, backoff_base_s * (2 ** (attempt - 1)))
+                        logger.warning(
+                            "[ToolListCache] init attempt %s/%s failed for %s: %s. Retrying in %.2fs",
+                            attempt,
+                            retries,
+                            tools_config_file,
+                            repr(e),
+                            backoff_s,
+                        )
+                        if backoff_s > 0:
+                            time.sleep(backoff_s)
+
+                if cached is None:
+                    # Defensive fallback: should be unreachable due to RuntimeError above.
+                    raise RuntimeError(
+                        f"[ToolListCache] initialization produced no tool list for {tools_config_file}"
+                    ) from last_error
+                cls._cache[tools_config_file] = cached
+            else:
+                logger.debug(f"[ToolListCache] cache hit for {tools_config_file}")
+
+            return copy.deepcopy(cached)


The current implementation of get_tool_list holds a global lock (cls._lock) for the entire duration of tool initialization. This includes potentially long-running operations like network calls in initialize_tools_from_config and time.sleep() during backoff. This creates a major contention point, as any thread trying to access the cache will be blocked, even if it's for a different and already-cached tool list. This can severely degrade performance under concurrent load.

To resolve this, the lock should only be held for brief, critical sections that access or modify the shared cache dictionary. The expensive initialization logic should be executed outside of the lock. A common and robust pattern to handle this and prevent the "thundering herd" problem (multiple threads trying to initialize the same resource) is to use a concurrent.futures.Future object as a placeholder in the cache while initialization is in progress.

I'm suggesting a refactoring of this method to implement this pattern, which will significantly improve concurrency and performance.

@classmethod def get_tool_list(cls, tools_config_file: str | None) -> list: if not tools_config_file: return [] with cls._lock: cached_item = cls._cache.get(tools_config_file) if cached_item is not None: if isinstance(cached_item, concurrent.futures.Future): logger.debug(f"[ToolListCache] waiting for initialization of {tools_config_file}") return copy.deepcopy(cached_item.result()) # This can re-raise exceptions from the initializer. logger.debug(f"[ToolListCache] cache hit for {tools_config_file}") return copy.deepcopy(cached_item) # Cache miss. We will attempt to initialize. with cls._lock: # Double-check in case another thread initialized while we waited for the lock. cached_item = cls._cache.get(tools_config_file) if cached_item is not None: if isinstance(cached_item, concurrent.futures.Future): return copy.deepcopy(cached_item.result()) return copy.deepcopy(cached_item) # We are the initializer. Place a future in the cache as a placeholder. future = concurrent.futures.Future() cls._cache[tools_config_file] = future # Perform initialization outside the lock. try: retries, timeout_s, backoff_base_s, backoff_max_s = cls._get_init_policy() logger.debug( "[ToolListCache] cache miss for %s, initializing tools (retries=%s, timeout_s=%.2f)", tools_config_file, retries, timeout_s, ) last_error = None for attempt in range(1, retries + 1): try: initialized_tools = initialize_tools_from_config(tools_config_file, mcp_init_timeout_s=timeout_s) future.set_result(initialized_tools) with cls._lock: cls._cache[tools_config_file] = initialized_tools return copy.deepcopy(initialized_tools) except Exception as e: last_error = e if attempt >= retries: raise RuntimeError( f"[ToolListCache] failed to initialize tools for {tools_config_file} " f"after {retries} attempts" ) from e backoff_s = min(backoff_max_s, backoff_base_s * (2 ** (attempt - 1))) logger.warning( "[ToolListCache] init attempt %s/%s failed for %s: %s. Retrying in %.2fs", attempt, retries, tools_config_file, repr(e), backoff_s, ) if backoff_s > 0: time.sleep(backoff_s) # This part should be unreachable if the loop above either returns or raises. raise RuntimeError( f"[ToolListCache] initialization produced no tool list for {tools_config_file}" ) from last_error except Exception as e: # On failure, set exception on future and remove placeholder from cache. future.set_exception(e) with cls._lock: if cls._cache.get(tools_config_file) is future: del cls._cache[tools_config_file] raise

…erl-project#1105) if we use sglang as the rollout engine, we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity is unbalanced, please refer to [verl-project#5426 in sglang](sgl-project/sglang#5426) # why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl？ 1. verl initializes a SGlangRollout module during rollout, which is used to evaluate/generate samples. 2. SGLangRollout will initialize VerlEngine, further initialize a torch. Distributed. DeviceMesh, used to support the TP. 3. DeviceMesh.init () internally checks the free video memory of all participating devices, and if the difference is too large (more than about 10%), it directly reports an error, preventing initialization failures or communication deadlock. # Why might there be inconsistent graphic memory？ ## Ray Distributed Actor loads the model at different times: verl uses ray multi-process multi-gpu concurrent training, and each `WorkerDict` may be called at different times: `self.rollout = SGLangRollout(...)` different workers initialize the model at different times → different memory usage. ## Delayed initialization causes memory bias Some workers enter the model loading/infer process earlier than others, such as `generate_sequences()` or `compute_log_prob()`. The early-loaded worker video memory has been eaten by the model, and the late-loaded worker video memory is still empty → the graphic memory gap is large. ## Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing SGLangRollout only needs to involve the part of the graphics card used by the rollout machine, but its VerlEngine initialization calls torch.distribut.init process group() and broadcast a bunch of weights. Result in: Non-rollout cards also participate in communication; Then initialize DeviceMesh, and the error "inconsistent memory" is reported. ## Different loading modes of FSDP/TP models also cause deviations if the following parameters are set ``` actor.fsdp_config.param_offload=True ref.fsdp_config.param_offload=True ``` Some worker parameters are on the CPU, and some parameters are shard to the GPU in advance. This also creates an asymmetric distribution of video memory. --------- Co-authored-by: ocss884 <ocss.lin@gmail.com>

Denis Megerle added 2 commits February 27, 2026 10:55

test: add ToolListCache caching coverage

9c9e3d9

Add a small unit test that validates ToolListCache caching semantics, deep-copy isolation, and retry/backoff behavior. Made-with: Cursor

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

denismegerle marked this pull request as ready for review February 27, 2026 10:10

denismegerle requested review from ArronHZG and wuxibin89 as code owners February 27, 2026 10:10

Merge branch 'main' into public/tool-list-caching

0e82d91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tool, rollout] refactor: cache tool list initialization with MCP timeout#5426

[tool, rollout] refactor: cache tool list initialization with MCP timeout#5426
denismegerle wants to merge 3 commits intoverl-project:mainfrom
denismegerle:public/tool-list-caching

denismegerle commented Feb 27, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

denismegerle commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

denismegerle commented Feb 27, 2026 •

edited

Loading

CLAassistant commented Feb 27, 2026 •

edited

Loading