Fix nan in global scaling factor for large scale nvfp4 EP #13162

wenscarl · 2025-11-12T18:30:46Z

Motivation

Part of #12866 in this PR), the root cause is that w13[2]_input_scale should not depends on logical_to_all_physical_map. In this fix, read in all physical experts' input scale regardless of its logical expert id.
cc @kaixih @Fridge003

The root cause is:
The w13_input_scale is of shape [288, ...] but if the map logical_to_all_physical_map is :

[
[286],
[287],
[2],
[3],
...,
[254],
[255]
]

there 0th and 1st row is missing and since w13_input_scale is initialized by torch.empty, the corresponding values are undefined.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

This reverts commit 99e2580.

Fix nan

kaixih

Just had a sync with @wenscarl.

The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.

For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:

logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255

The input_scales tensor would have shape (258,), and it then would look like this after loading:

physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1

After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.

wenscarl · 2025-11-12T19:55:52Z

Just had a sync with @wenscarl.

The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.

For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:
logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255
The input_scales tensor would have shape (258,), and it then would look like this after loading:
physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1
After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.

That exactly my understanding.

Fridge003 · 2025-11-12T20:48:04Z

@kaixih @wenscarl Have you verified how this PR works on GB200?

wenscarl · 2025-11-12T20:58:35Z

@kaixih @wenscarl Have you verified how this PR works on GB200?

Verified it fix the nan issue running PD disagg workload.

…3162)" This reverts commit 7aa4439.

…-project#13162) Co-authored-by: Kangyan-Zhou <[email protected]>

) Co-authored-by: Kangyan-Zhou <[email protected]>

… and sgl-project#13341) (sgl-project#13348)" This reverts commit 78a4b44.

wenscarl added 2 commits November 12, 2025 18:00

Revert "[Fix] Fix nan error for large scale ep (sgl-project#12866)"

05679af

This reverts commit 99e2580.

Fix nan

5170265

Fix nan

wenscarl marked this pull request as ready for review November 12, 2025 18:59

wenscarl requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, fzyzcjy, ispobock, kushanam and merrymercy as code owners November 12, 2025 18:59

kaixih approved these changes Nov 12, 2025

View reviewed changes

Fridge003 added the run-ci label Nov 12, 2025

Fridge003 approved these changes Nov 12, 2025

View reviewed changes

Fridge003 merged commit 7aa4439 into sgl-project:main Nov 12, 2025
141 of 154 checks passed

Fridge003 added a commit that referenced this pull request Nov 15, 2025

Revert "Fix nan in global scaling factor for large scale nvfp4 EP (#1…

cd87835

…3162)" This reverts commit 7aa4439.

Fridge003 mentioned this pull request Nov 15, 2025

Revert "Fix nan in global scaling factor for large scale nvfp4 EP" #13347

Closed

Qiaolin-Yu added a commit to Qiaolin-Yu/sglang that referenced this pull request Nov 15, 2025

Revert Fix nan in global scaling factor for large scale nvfp4 EP (sgl…

6392d71

…-project#13162) Co-authored-by: Kangyan-Zhou <[email protected]>

gemini-code-assist bot mentioned this pull request Nov 15, 2025

Fix dpsk-r1-fp4 tp8 by reverting two commits (#13162 and #13341) #13348

Merged

4 tasks

Kangyan-Zhou added a commit that referenced this pull request Nov 16, 2025

Fix dpsk-r1-fp4 tp8 by reverting two commits (#13162 and #13341) (#13348

78a4b44

) Co-authored-by: Kangyan-Zhou <[email protected]>

wenscarl added a commit to wenscarl/sglang that referenced this pull request Nov 18, 2025

Revert "Fix dpsk-r1-fp4 tp8 by reverting two commits (sgl-project#13162…

afab448

… and sgl-project#13341) (sgl-project#13348)" This reverts commit 78a4b44.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix nan in global scaling factor for large scale nvfp4 EP #13162

Fix nan in global scaling factor for large scale nvfp4 EP #13162

Uh oh!

wenscarl commented Nov 12, 2025 •

edited

Loading

Uh oh!

kaixih left a comment •

edited

Loading

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Fridge003 commented Nov 12, 2025

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix nan in global scaling factor for large scale nvfp4 EP #13162

Fix nan in global scaling factor for large scale nvfp4 EP #13162

Uh oh!

Conversation

wenscarl commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

kaixih left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Fridge003 commented Nov 12, 2025

Uh oh!

wenscarl commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenscarl commented Nov 12, 2025 •

edited

Loading

kaixih left a comment •

edited

Loading