feat: Support Ray Compiled Graph and NCCL-optimized data transfer for SFT #1612
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds Ray Compiled Graph support and NCCL-optimized data transfer for SFT
TODO:
This PR introduces significant performance optimizations and infrastructure improvements to the training pipeline:
Key Features
🚀 Ray Compiled Graph Support
Implements Ray's compiled DAG execution for distributed training with support for:
Enable with:
export NEMO_RL_USE_COMPILED_GRAPH=1
Reduces data transfer overhead by:
Enable with:
export NEMO_RL_OPTIMIZE_DATA_TRANSFER=1
Technical Changes
Architecture Changes
Wrapper Pattern for Ray Actors:
@ray.remotedecoration on worker classesNeMoRayWorkerWrapperclass (vLLM-style architecture)worker.execute_method.remote(method_name, *args, **kwargs)Compiled Graph Execution:
CompiledGraphExecutor: Single DP shard DAG executionMultiDPCompiledGraphExecutor: Multiple independent DAGs (one per DP shard)CompiledGraphWorkerGroup: Drop-in replacement wrapper forRayWorkerGroupWorker Implementations
Updated all policy workers with NCCL broadcast support:
MegatronPolicyWorker: Added tensor reconstruction and NCCL broadcastDTensorPolicyWorker: Added NCCL broadcast via DeviceMesh groupsDTensorPolicyWorkerV2: Added NCCL broadcast supportAdditional Improvements
Warmup Support: Compile graphs with max sequence length before training
export NEMO_RL_WARMUP_COMPILED_GRAPH=1
export NEMO_RL_WARMUP_SEQ_LEN=8192 # Optional override
Nsys Profile Naming: Windows-compatible safe filenames (replaces
:with_)Ray Log Sync: Enabled by default with 30s frequency
Improved Shutdown: Graceful compiled DAG teardown with suppressed Ray logging
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information