-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Labels
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
This issue tracks the ongoing work originating from NVIDIA.
- The current EPLB implementation is broken because the first dimension of the global scaling factor does not match num_local_expert. As a temporary workaround, see Fix EPLB + FP4 Quantization Compatibility Issue #13715. @shifangx
@wenscarl - NVIDIA has proposed a Linear Programming (LP)–based expert-parallelism algorithm. The integration will proceed in two stages:
a) Integrate the LP kernel into FlashInfer. @feliang-git
b) Integrate the FlashInfer operator into sglang.
Related resources
No response
Reactions are currently unavailable