Fix oom on first iteration in resharding by tianzhencong · Pull Request #1 · tianzhencong/verl

tianzhencong · 2025-08-02T03:56:01Z

What does this PR do?

This PR fixes an Out of Memory (OOM) issue occurring during multi-turn interaction processing, specifically when using multi_turn_interaction (related to verl-project#2189). The bug was caused by insufficient GPU memory during SGLang's memory operations, similar to the OOM issue fixed in verl-project#1911.

The fix re-introduces strategic get_torch_device().empty_cache() calls around memory-intensive operations within the SGLangRollout worker to ensure GPU cache is cleared before memory allocation, preventing OOM crashes.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: sglang multiturn problem: the actor died after every first epoch trained verl-project/verl#2189
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- [sglang, worker] fix: OOM in multi-turn interaction

Test

This fix allows the multi_turn_interaction scenarios that previously crashed due to OOM to run successfully.

API and Usage Example

No API changes were made. The fix is internal to memory management.

# No API changes or new usage examples.
# The existing multi-turn interaction examples should now run without OOM.

Design & Code Changes

High-Level Design:
The core design principle is to proactively clear the GPU memory cache (torch.cuda.empty_cache()) before and after critical memory-intensive operations within the SGLang rollout worker. This ensures that sufficient memory is available for SGLang's internal memory management (e.g., resume_memory_occupation) and prevents memory accumulation that leads to OOM.

Specific Changes:
Added get_torch_device().empty_cache() calls in verl/workers/rollout/sglang_rollout/sglang_rollout.py at the following points:

Before and after interaction.generate_response: To clear cache around the main interaction processing loop.
Before and after tool processing: Specifically around asyncio.gather(*tool_reward_tasks).
Before and after tool creation: Specifically around asyncio.gather(*tool_creation_coroutines) in _handle_pending_state.
Before and after interaction initialization: Specifically around interaction.start_interaction in _handle_pending_state.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

_{Learn more about Cursor Agents}

Co-authored-by: t.ianzhencong <t.ianzhencong@gmail.com>

MetalMedicine · 2025-08-08T22:23:15Z

Is this commit missing this?:
from verl.utils.device import get_torch_device

Add memory cache clearing to prevent OOM in multi-turn interactions

97bde1c

Co-authored-by: t.ianzhencong <t.ianzhencong@gmail.com>

tianzhencong marked this pull request as ready for review August 4, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix oom on first iteration in resharding#1

Fix oom on first iteration in resharding#1
tianzhencong wants to merge 1 commit intomainfrom
cursor/fix-oom-on-first-iteration-in-resharding-674d

tianzhencong commented Aug 2, 2025

Uh oh!

MetalMedicine commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianzhencong commented Aug 2, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

MetalMedicine commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants