Skip to content

Fix oom on first iteration in resharding#1

Open
tianzhencong wants to merge 1 commit intomainfrom
cursor/fix-oom-on-first-iteration-in-resharding-674d
Open

Fix oom on first iteration in resharding#1
tianzhencong wants to merge 1 commit intomainfrom
cursor/fix-oom-on-first-iteration-in-resharding-674d

Conversation

@tianzhencong
Copy link
Owner

What does this PR do?

This PR fixes an Out of Memory (OOM) issue occurring during multi-turn interaction processing, specifically when using multi_turn_interaction (related to verl-project#2189). The bug was caused by insufficient GPU memory during SGLang's memory operations, similar to the OOM issue fixed in verl-project#1911.

The fix re-introduces strategic get_torch_device().empty_cache() calls around memory-intensive operations within the SGLangRollout worker to ensure GPU cache is cleared before memory allocation, preventing OOM crashes.

Checklist Before Starting

Test

This fix allows the multi_turn_interaction scenarios that previously crashed due to OOM to run successfully.

API and Usage Example

No API changes were made. The fix is internal to memory management.

# No API changes or new usage examples.
# The existing multi-turn interaction examples should now run without OOM.

Design & Code Changes

High-Level Design:
The core design principle is to proactively clear the GPU memory cache (torch.cuda.empty_cache()) before and after critical memory-intensive operations within the SGLang rollout worker. This ensures that sufficient memory is available for SGLang's internal memory management (e.g., resume_memory_occupation) and prevents memory accumulation that leads to OOM.

Specific Changes:
Added get_torch_device().empty_cache() calls in verl/workers/rollout/sglang_rollout/sglang_rollout.py at the following points:

  1. Before and after interaction.generate_response: To clear cache around the main interaction processing loop.
  2. Before and after tool processing: Specifically around asyncio.gather(*tool_reward_tasks).
  3. Before and after tool creation: Specifically around asyncio.gather(*tool_creation_coroutines) in _handle_pending_state.
  4. Before and after interaction initialization: Specifically around interaction.start_interaction in _handle_pending_state.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Open in Cursor Open in Web

Learn more about Cursor Agents

Co-authored-by: t.ianzhencong <t.ianzhencong@gmail.com>
@tianzhencong tianzhencong marked this pull request as ready for review August 4, 2025 07:02
@MetalMedicine
Copy link

Is this commit missing this?:
from verl.utils.device import get_torch_device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants