-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fix nan in global scaling factor for large scale nvfp4 EP #13162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nan in global scaling factor for large scale nvfp4 EP #13162
Conversation
This reverts commit 99e2580.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a sync with @wenscarl.
The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.
For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:
logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255
The input_scales tensor would have shape (258,), and it then would look like this after loading:
physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1
After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.
That exactly my understanding. |
…-project#13162) Co-authored-by: Kangyan-Zhou <[email protected]>
) Co-authored-by: Kangyan-Zhou <[email protected]>
… and sgl-project#13341) (sgl-project#13348)" This reverts commit 78a4b44.
Motivation
Part of #12866 in this PR), the root cause is that
w13[2]_input_scaleshould not depends onlogical_to_all_physical_map. In this fix, read in all physical experts' input scale regardless of its logical expert id.cc @kaixih @Fridge003
The root cause is:
The
w13_input_scaleis of shape [288, ...] but if the maplogical_to_all_physical_mapis :there 0th and 1st row is missing and since
w13_input_scaleis initialized bytorch.empty, the corresponding values are undefined.Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist