Skip to content

Conversation

@wenscarl
Copy link
Collaborator

@wenscarl wenscarl commented Nov 12, 2025

Motivation

Part of #12866 in this PR), the root cause is that w13[2]_input_scale should not depends on logical_to_all_physical_map. In this fix, read in all physical experts' input scale regardless of its logical expert id.
cc @kaixih @Fridge003

The root cause is:
The w13_input_scale is of shape [288, ...] but if the map logical_to_all_physical_map is :

[
[286],
[287],
[2],
[3],
...,
[254],
[255]
]

there 0th and 1st row is missing and since w13_input_scale is initialized by torch.empty, the corresponding values are undefined.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Collaborator

@kaixih kaixih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a sync with @wenscarl.

The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.

For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:

logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255

The input_scales tensor would have shape (258,), and it then would look like this after loading:

physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1

After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.

@wenscarl
Copy link
Collaborator Author

Just had a sync with @wenscarl.

The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.

For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:

logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255

The input_scales tensor would have shape (258,), and it then would look like this after loading:

physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1

After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.

That exactly my understanding.

@Fridge003
Copy link
Collaborator

@kaixih @wenscarl Have you verified how this PR works on GB200?

@wenscarl
Copy link
Collaborator Author

@kaixih @wenscarl Have you verified how this PR works on GB200?

Verified it fix the nan issue running PD disagg workload.

@Fridge003 Fridge003 merged commit 7aa4439 into sgl-project:main Nov 12, 2025
141 of 154 checks passed
Fridge003 added a commit that referenced this pull request Nov 15, 2025
Qiaolin-Yu added a commit to Qiaolin-Yu/sglang that referenced this pull request Nov 15, 2025
Kangyan-Zhou added a commit that referenced this pull request Nov 16, 2025
wenscarl added a commit to wenscarl/sglang that referenced this pull request Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants