Skip to content

Commit 0f7e8e8

Browse files
committed
Merge branch 'main' of github.com:NVIDIA-NeMo/RL into ashors/gpt-oss-tot
2 parents aaa1c12 + 8762f57 commit 0f7e8e8

46 files changed

Lines changed: 1011 additions & 340 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitmodules

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
[submodule "3rdparty/Megatron-LM"]
22
path = 3rdparty/Megatron-LM-workspace/Megatron-LM
3-
url = https://github.com/ashors1/Megatron-LM.git
4-
branch = gpt-oss-tot2
3+
url = https://github.com/terrykong/Megatron-LM.git
4+
branch = ashors/dev-with-gpt-oss
55
shallow = true
66
[submodule "3rdparty/Megatron-Bridge"]
77
path = 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
88
url = https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
9-
branch = main
9+
branch = ashors/gpt-oss-tot
1010
shallow = true
1111
[submodule "3rdparty/Automodel-workspace/Automodel"]
1212
path = 3rdparty/Automodel-workspace/Automodel
Submodule Megatron-Bridge updated 70 files

3rdparty/Megatron-Bridge-workspace/setup.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,12 @@
3737
"pyyaml>=6.0.2",
3838
"tqdm>=4.67.1",
3939
"hydra-core>1.3,<=1.3.2",
40+
"megatron-core[dev,mlm]>=0.15.0a0,<0.16.0",
4041
"qwen-vl-utils",
41-
"causal-conv1d",
42+
"transformer-engine[pytorch]>=2.9.0a0,<2.10.0",
4243
"mamba-ssm",
43-
"megatron-core[dev,mlm]>=0.15.0a0,<0.16.0",
4444
"nvidia-resiliency-ext",
45-
"transformer-engine[pytorch]>=2.9.0a0,<2.10.0",
46-
"transformers>=4.57.1",
45+
"causal-conv1d",
4746
]
4847

4948
# If the bridge source exists, compare cached dependencies with the submodule's pyproject

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
DAPO extends GRPO with **Clip-Higher**, **Dynamic Sampling**, **Token-Level Policy Gradient Loss**, and **Overlong Reward Shaping** for more stable and efficient RL training. See the [DAPO guide](docs/guides/dapo.md) for more details.
99
* [9/30/2025][Accelerated RL on GCP with NeMo RL!](https://discuss.google.dev/t/accelerating-reinforcement-learning-on-google-cloud-using-nvidia-nemo-rl/269579/4)
1010
* [9/27/2025] [FP8 Quantization in NeMo RL](https://github.com/NVIDIA-NeMo/RL/discussions/1216)
11-
* [9/25/2025] On-policy Distillation (Qwen3-style)
11+
* [9/25/2025] On-policy Distillation
1212
* Student generates on-policy sequences and aligns logits to a larger teacher via KL, achieving near-larger-model quality at lower cost than RL. See [On-policy Distillation](#on-policy-distillation).
1313

1414
<details>
@@ -71,12 +71,12 @@ For detailed information on backend selection, configuration, and examples, see
7171
- 🔜 **Megatron Bridge Integration** - Integrate Megatron Bridge to enable training features from Megatron Core.
7272
- 🔜 **NeMo Automodel Integration** - Integrate NeMo Automodel to power our DTensor path.
7373
- 🔜 **New Models** - gpt-oss.
74-
- 🔜 **Expand Algorithms** - DAPO, GSPO, On-policy Distillation.
74+
- 🔜 **Expand Algorithms** - DAPO, GSPO.
7575
- 🔜 **GB200** - Add container support for GB200.
7676
-**Distributed Training** - Ray-based infrastructure.
7777
-**Environment Support and Isolation** - Support for multi-environment training and dependency isolation between components.
7878
-**Worker Isolation** - Process isolation between RL Actors (no worries about global state).
79-
-**Learning Algorithms** - GRPO/GSPO, SFT, and DPO.
79+
-**Learning Algorithms** - GRPO/GSPO, SFT, DPO, and On-policy distillation.
8080
-**Multi-Turn RL** - Multi-turn generation and training for RL with tool use, games, etc.
8181
-**Advanced Parallelism with DTensor** - PyTorch FSDP2, TP, CP, and SP for efficient training.
8282
-**Larger Model Support with Longer Sequences** - Performant parallelisms with Megatron Core (TP/PP/CP/SP/EP/FSDP).

docs/guides/async-grpo.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ grpo:
4141
async_grpo:
4242
enabled: true
4343
max_trajectory_age_steps: 1 # Maximum age, in training steps, for trajectories
44+
in_flight_weight_updates: false # Enable for faster weight synchronization
45+
recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates
4446
```
4547
4648
### Complete Example Config
@@ -65,6 +67,8 @@ grpo:
6567
async_grpo:
6668
enabled: true
6769
max_trajectory_age_steps: 1
70+
in_flight_weight_updates: false # Enable for faster weight synchronization
71+
recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates
6872

6973
cluster:
7074
num_nodes: 2
@@ -158,6 +162,11 @@ sequenceDiagram
158162

159163
3. **Resource Allocation**: Ensure sufficient GPU memory for both the training and generation clusters
160164

165+
4. **In-Flight Weight Updates**: Enable `in_flight_weight_updates: true` when using `async_engine: true` for updating the weights of vLLM engine during generation. This prevents stalling training pipeline until longest generation finishes and provides significant performance benefits.
166+
167+
5. **Recompute KV Cache After Weight Updates**: While using in-flight weight update, user can choose whether to recompute
168+
KV caches after weight udpate by configuring `recompute_kv_cache_after_weight_update` configuration.
169+
161170
## Why Importance Sampling Correction Is Required for Async
162171

163172
### The GRPO Objective

docs/guides/grpo.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ In this guide, we'll walk through how we handle:
2828

2929
We support training with multiple RL "Environments" at the same time.
3030

31-
An [Environment](../../nemo_rl/environments/interfaces.py) is an object that accepts a state/action history and returns an update state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_rl/environments/math_environment.py).
31+
An [Environment](../../nemo_rl/environments/interfaces.py) is an object that accepts a state/action history and returns an updated state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](../../nemo_rl/environments/math_environment.py).
3232

3333
To support this, we need to know:
3434

@@ -163,9 +163,8 @@ L(\theta) = E_t \Big[ \max \Big( \min \big(r_t(\theta) A_t, \text{clip}(r_t(\the
163163
$$
164164

165165
where:
166-
- c is the dual-clip parameter (ratio_clip_c), which must be greater than 1 and is
167-
usually set as 3 empirically
168-
- $r_t(\theta)$ is the ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ that measures how much the policy has change
166+
- c is the dual-clip parameter (ratio_clip_c), which must be greater than 1 and is usually set as 3 empirically
167+
- $r_t(\theta)$ is the ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ that measures how much the policy has changed
169168

170169
### Improvements to the GRPO Loss Formulation for Stability and Accuracy
171170

@@ -279,7 +278,7 @@ We observed a case where vLLM assigned a disproportionately high probability to
279278
logp_gen (from vLLM): -5.xxx
280279
logp_policy (from Mcore): -15.xxx
281280
```
282-
Assuming other tokens have near-zero divergence, this single token's metrics are:
281+
Assuming other tokens have near-zero divergence, this single token's metrics with `kl_type=k3` are:
283282

284283
* `gen_kl_error`: exp(-15 + 5) - (-15 + 5) - 1 ≈ 9 (moderate mismatch)
285284
* `policy_kl_error`: exp(-5 + 15) - (-5 + 15) - 1 ≈ 22,015 (severe mismatch dominating the metric)

docs/nsys-profiling.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,8 @@ To analyze the generated profile files, load the `.nsys-rep` files into the NVID
100100

101101
Nsight Systems supports [multi-report view](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#viewing-multiple-reports-in-the-same-timeline) functionality. If you open the profiles from different workers (e.g., `*policy_worker*.nsys-rep` and `*generation_worker*.nsys-rep`) in a single multi-report view, you can analyze the behavior of the end-to-end RL loop on the same timeline.
102102

103-
<img src="assets/nsys-multi-report-view.png" alt="Pretraining loss curves" width="1000"/>
103+
104+
![Nsys multi report view](./assets/nsys-multi-report-view.png)
104105

105106
## How We Patched Nsight Support in Ray
106107

examples/configs/distillation_math.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,6 @@ policy: &POLICY_BASE
145145
grad_reduce_in_fp32: false
146146
overlap_grad_reduce: true
147147
overlap_param_gather: true
148-
average_in_collective: true
149148
use_custom_fsdp: false
150149
data_parallel_sharding_strategy: "optim_grads_params"
151150

examples/configs/distillation_math_megatron.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,6 @@ policy: &POLICY_BASE
9999
grad_reduce_in_fp32: false
100100
overlap_grad_reduce: true
101101
overlap_param_gather: true
102-
average_in_collective: true
103102
use_custom_fsdp: false
104103
data_parallel_sharding_strategy: "optim_grads_params"
105104

0 commit comments

Comments
 (0)