-
Notifications
You must be signed in to change notification settings - Fork 356
refit latest update based on ahmads refactor #1993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d586ddb
96d4a11
9bc1b9a
e5d1ae9
50aa2de
13a5c0a
af27054
6b4ad6f
c19ee75
b0bd9f9
fb3905c
924e32a
4ba7f09
07f73ce
3b9f96b
f6f06c9
04d300f
9527d56
8d0d33e
8a09bf8
5e4a61f
3dde60d
5258e35
3f93550
af43ba3
a4f1dcd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| # NCCL Timeout During CUDA Graph Warmup in MoE RL Training | ||
|
|
||
| ## Symptom | ||
|
|
||
| After several successful GRPO training steps (anywhere from step 5 to step 20+), the job crashes with NCCL collective operation timeouts during the generation phase. The errors look like: | ||
|
|
||
| ``` | ||
| (MegatronPolicyWorker[rank=1] pid=151407) [rank1]:[E228 00:10:27.202494521 ProcessGroupNCCL.cpp:2057] [PG ID 12 PG GUID 99(EXPERT_MODEL_PARALLEL_GROUP) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1145527, OpType=ALLTOALL_BASE, NumelIn=206438400, NumelOut=206438400, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. | ||
| (MegatronPolicyWorker[rank=1] pid=151407) | ||
| (MegatronPolicyWorker[rank=1] pid=151407) [2026-02-28 00:10:27,258 E 151407 153013] logging.cc:118: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 12 PG GUID 99(EXPERT_MODEL_PARALLEL_GROUP) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1145527, OpType=ALLTOALL_BASE, NumelIn=206438400, NumelOut=206438400, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. | ||
| (MegatronPolicyWorker[rank=1] pid=151407) | ||
| (MegatronPolicyWorker[rank=1] pid=151407) | ||
| (MegatronPolicyWorker[rank=7] pid=151429) [rank7]:[E228 00:10:27.168785312 ProcessGroupNCCL.cpp:2057] [PG ID 5 PG GUID 36(TENSOR_MODEL_PARALLEL_GROUP) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2036074, OpType=_REDUCE_SCATTER_BASE, NumelIn=3225600, NumelOut=1612800, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. | ||
| (MegatronPolicyWorker[rank=7] pid=151429) | ||
| (MegatronPolicyWorker[rank=7] pid=151429) [2026-02-28 00:10:27,224 E 151429 152962] logging.cc:118: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 5 PG GUID 36(TENSOR_MODEL_PARALLEL_GROUP) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2036074, OpType=_REDUCE_SCATTER_BASE, NumelIn=3225600, NumelOut=1612800, Timeout(ms)=600000) ran for 600000 milliseconds before timing out. | ||
| ``` | ||
|
|
||
| Key signatures: | ||
| - Different ranks report **different NCCL operation types** (ALLTOALL_BASE, REDUCE_SCATTER_BASE, ALLREDUCE, COALESCED) -- a collective mismatch | ||
| - The crash always happens during `cuda graph warmup` at the start of generation | ||
| - A new NCCL communicator is lazily initialized at the failing step (`NCCL version 2.27.5+cuda12.9` printed mid-warmup) | ||
| - Steps 1 through N-1 complete normally; the crash is non-deterministic | ||
|
|
||
| ## Background: The Training-Inference Cycle | ||
|
|
||
| In the RL training loop, each step does: | ||
|
|
||
| ``` | ||
| generate() { | ||
| _wake() // resume inference engine (realloc KV cache, rebuild CUDA graphs) | ||
| <submit requests, run generation> | ||
| _sleep() // suspend inference engine (dealloc KV cache, delete CUDA graphs) | ||
| } | ||
| ``` | ||
|
|
||
| With `static_kv_memory_pointers=false` and `kv_cache_management_mode=recompute`, every suspend/resume cycle **destroys and recreates CUDA graphs**. Graph warmup runs forward passes through the model, which for MoE models includes NCCL alltoall collectives across expert-parallel (EP) ranks. All EP/TP ranks must execute the same sequence of NCCL operations in lockstep during this warmup. | ||
|
|
||
| ## Architecture: Two Communication Systems on One Event Loop | ||
|
|
||
| The `DynamicInferenceEngine` runs an async engine loop on a dedicated event loop thread. This single event loop handles two different communication systems: | ||
|
|
||
| | System | Purpose | Mechanism | | ||
| |--------|---------|-----------| | ||
| | **EP consensus** (`_ep_group_has_work`) | Coordinate EP ranks on work availability | Async ZMQ all-reduce | | ||
| | **CUDA graph warmup** (inside `resume()`) | Capture model forward passes into graphs | Blocking NCCL collectives | | ||
|
|
||
| Both run on the **same event loop thread**. This is the root of the problem. | ||
|
|
||
| ## Root Cause | ||
|
|
||
| The engine loop has this structure (simplified from `run_engine_with_coordinator`): | ||
|
|
||
| ```python | ||
| while True: | ||
| self.schedule_requests() # read ZMQ messages | ||
| ep_group_has_work = await self._ep_group_has_work(...) # ZMQ all-reduce across EP ranks | ||
| if not ep_group_has_work: | ||
| if self.suspend_signal: | ||
| self.suspend() # no-op when already suspended | ||
| else: | ||
| self.resume() # CUDA graph warmup -- blocks with NCCL! | ||
| await asyncio.sleep(0.02) | ||
| ``` | ||
|
|
||
| When the coordinator sends `RESUME + UNPAUSE` to all engines, the signals arrive asynchronously. EP ranks process them at different times depending on ZMQ delivery and event loop scheduling. This leads to a **divergence**: | ||
|
|
||
| ``` | ||
| Rank A (received RESUME): suspend_signal=False --> calls resume() --> NCCL alltoall BLOCKS event loop | ||
| Rank B (not yet received): suspend_signal=True --> calls suspend() (no-op) --> sleeps 20ms | ||
| ``` | ||
|
|
||
| On the next iteration, Rank B calls `_ep_group_has_work()` which does an async ZMQ all-reduce. This requires Rank A to respond. But Rank A's event loop is **blocked inside NCCL** (graph warmup forward pass). Rank A can never respond to ZMQ while NCCL is blocking its event loop. | ||
|
|
||
| **Deadlock: Rank A waits for Rank B in NCCL. Rank B waits for Rank A in ZMQ.** | ||
|
|
||
| After 10 minutes, the NCCL watchdog times out and kills the process. | ||
|
|
||
| ### Why it's non-deterministic | ||
|
|
||
| The deadlock only occurs when at least one EP rank enters `resume()` before all other EP ranks have received the `RESUME` signal. When all ranks happen to process the signals within the same ~20ms engine loop cycle, they all enter `resume()` together and the warmup succeeds. This timing depends on ZMQ delivery, event loop scheduling, and OS thread scheduling -- hence the non-determinism. | ||
|
|
||
|
|
||
| ### Implementation 1 (This causes delay of 25%) | ||
|
|
||
| ```python | ||
| def _wake(self): | ||
| # Phase 1: Unpause the engine loop (async, event loop stays free for ZMQ) | ||
| asyncio.run_coroutine_threadsafe(self._unpause_engine(), self._inference_loop).result() | ||
|
|
||
| # Phase 2: Synchronized resume on the main thread | ||
| self._synchronized_resume() | ||
|
|
||
| async def _unpause_engine(self): | ||
| # Send only UNPAUSE (not RESUME) -- keeps suspend_signal=True so the engine | ||
| # loop never calls resume() on its own | ||
| if torch.distributed.get_rank() == 0: | ||
| self.inference_client.unpause_engines() | ||
| await self.dynamic_inference_engine.running.wait() | ||
|
|
||
| def _synchronized_resume(self): | ||
| engine = self.dynamic_inference_engine | ||
|
|
||
| # Guard: replace suspend() with a no-op while we resume | ||
| original_suspend = engine.suspend | ||
| engine.suspend = lambda: None | ||
|
|
||
| try: | ||
| torch.distributed.barrier() # all ranks ready | ||
| engine.resume() # CUDA graph warmup (NCCL collectives) | ||
| engine.suspend_signal = False # let engine loop transition to normal mode | ||
| torch.distributed.barrier() # all ranks done | ||
| finally: | ||
| engine.suspend = original_suspend | ||
| ``` | ||
|
|
||
| ### Why this works | ||
|
|
||
| **No event-loop blocking.** The NCCL barriers and `resume()` run on the main thread. The event loop thread continues running the engine loop, freely handling ZMQ communication for EP consensus. No rank's ZMQ is ever starved. | ||
|
|
||
| **No RESUME signal divergence.** We never send the `RESUME` header to the coordinator. Instead, we send only `UNPAUSE` (which restarts the engine loop) and keep `suspend_signal=True`. The engine loop sees `suspend_signal=True`, calls `suspend()` (no-op since already suspended), and idles. It never calls `resume()` on its own. We control exactly when `resume()` happens -- after the barrier on the main thread. | ||
|
|
||
| **No thread-safety race.** When `resume()` runs on the main thread, it sets `is_suspended=False`. Without the guard, the engine loop's next `suspend()` call (on the event loop thread) would read `is_suspended=False`, enter the suspend body, and **deallocate buffers while the main thread is still creating CUDA graphs**. The `suspend()` guard (replacing it with `lambda: None`) prevents this. The guard is removed only after `suspend_signal=False` is set, so the engine loop transitions to calling `resume()` (which is a no-op since we already resumed) instead of `suspend()`. | ||
|
|
||
| ### Thread interaction timeline | ||
|
|
||
| ``` | ||
| Main Thread Event Loop Thread (engine loop) | ||
| ----------- -------------------------------- | ||
| _unpause_engine() ----sends UNPAUSE----> | ||
| schedule_requests(): reads UNPAUSE | ||
| _ep_group_has_work(): ZMQ all-reduce | ||
| suspend_signal=True -> suspend() [no-op] | ||
| asyncio.sleep(0.02) | ||
|
|
||
| barrier() ............all ranks sync.... (ZMQ continues running freely) | ||
|
|
||
| engine.suspend = no-op suspend() -> no-op [guarded] | ||
| engine.resume() (ZMQ continues, no GPU conflict) | ||
| -> reinitialize buffers | ||
| -> create_cuda_graphs() [NCCL] | ||
| engine.suspend_signal = False (ZMQ continues) | ||
|
|
||
| barrier() ............all ranks done.... | ||
|
|
||
| engine.suspend = original suspend_signal=False -> resume() [no-op] | ||
| (engine is ready for requests) | ||
| ``` | ||
|
|
||
| ## Affected Configuration | ||
|
|
||
| This bug affects MoE models using the megatron generation backend with: | ||
| - `moe_token_dispatcher_type=alltoall` (NCCL alltoall inside CUDA graph warmup) | ||
| - `static_kv_memory_pointers=false` (CUDA graphs deleted/recreated each cycle) | ||
| - `kv_cache_management_mode=recompute` (full dealloc on suspend) | ||
| - `num_cuda_graphs > 0` | ||
| - Expert parallelism (EP) > 1 | ||
|
|
||
| Dense models or configs with `static_kv_memory_pointers=true` are not affected because CUDA graphs are not recreated on resume. | ||
|
|
||
| ### Implementation 2 | ||
| In dynamic_engine.py you set asyncio.sleep(0) instead of 0.02. This works |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,19 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # you may not use this file except in compliance with the License. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # You may obtain a copy of the License at | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # See the License for the specific language governing permissions and | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # limitations under the License. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+1
to
+13
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update copyright year to 2026. This new file should use the current year per repo rules. Proposed fix-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: "{src/,examples/,nemo_rl/**}/*.{py,sh}: Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests". 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from nemo_rl.models.generation.megatron.megatron_generation import ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MegatronGeneration, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
| __all__ = ["MegatronGeneration"] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non‑colocated Megatron path is blocked by an earlier assert.
You added a non‑colocated Megatron initialization path here, but the earlier guard in the non‑colocated cluster setup still asserts
backend != "megatron", so this branch is unreachable. Please remove or relax that guard to enable the new feature.Proposed fix (remove legacy guard)
🤖 Prompt for AI Agents