[Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers #13590

michelemarzollo · 2025-11-19T15:03:00Z

Motivation

In PD disaggregation currently the buffers to move the draft model's hidden states are allocated based on the target model's hidden state size. This works only if target and draft model have the same hidden state size.
While LLama3-8B the sizes match (so the bug is invisible), if I try to run https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct with https://huggingface.co/yuhuili/EAGLE3-LLaMA3.3-Instruct-70B I get the following output:

[2025-11-19 14:49:50 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2746, in run_scheduler_process
    scheduler.event_loop_normal_disagg_prefill()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 343, in event_loop_normal_disagg_prefill
    self.process_batch_result_disagg_prefill(batch, result)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 460, in process_batch_result_disagg_prefill
    self.send_kv_chunk(req, last_chunk=True)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 652, in send_kv_chunk
    self.disagg_metadata_buffers.set_buf(req)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/utils.py", line 222, in set_buf
    self.output_hidden_states[req.metadata_buffer_index].copy_(
RuntimeError: The size of tensor a (8192) must match the size of tensor b (6144) at non-singleton dimension 0

[2025-11-19 14:49:50] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2025-11-19 14:49:50 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2746, in run_scheduler_process
    scheduler.event_loop_normal_disagg_prefill()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 343, in event_loop_normal_disagg_prefill
    self.process_batch_result_disagg_prefill(batch, result)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 460, in process_batch_result_disagg_prefill
    self.send_kv_chunk(req, last_chunk=True)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 652, in send_kv_chunk
    self.disagg_metadata_buffers.set_buf(req)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/utils.py", line 222, in set_buf
    self.output_hidden_states[req.metadata_buffer_index].copy_(
RuntimeError: The size of tensor a (8192) must match the size of tensor b (6144) at non-singleton dimension 0

Where 8192 is the target model's hidden size, while 6144 is the draft model's hidden size.

Modifications

I just pick the hidden size from the draft models' config. Not sure if the draft model config's should be held by the scheduler as I did it. I am open to suggestions!

I also changed the MetadataBuffers to avoid using them when they are not needed. In my understanding they are only needed for eagle, but I am not familiar with this part of the codebase. Please, have a look! (I tested with NGRAM and it works, while I am not able to run the Standalone specualation.

Accuracy Tests

No need. Before it was crashing, now it works.

Benchmarking and Profiling

No need. Before it was crashing, now it works.
But to reproduce, I am running the following script to start the server:

#!/usr/bin/env bash
set -e


# EAGLE3

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server \
    --model /storage/datasets/huggingface/models/Llama-3.3-70B-Instruct  \
    --cuda-graph-max-bs 64 \
    --max-running-requests 64 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path /storage/datasets/huggingface/models/sglang-EAGLE3-LLaMA3.3-Instruct-70B \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    --tp 2 &


CUDA_VISIBLE_DEVICES=2,3 python3 -m sglang.launch_server \
    --model /storage/datasets/huggingface/models/Llama-3.3-70B-Instruct  \
    --cuda-graph-max-bs 2 \
    --max-running-requests 2 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path /storage/datasets/huggingface/models/sglang-EAGLE3-LLaMA3.3-Instruct-70B \
    --disaggregation-mode decode \
    --port 30001 \
    --disaggregation-transfer-backend nixl \
    --tp 2 &


# TO RUN FOR ALL

python3 -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://127.0.0.1:30000 \
    --decode http://127.0.0.1:30001 \
    --host 0.0.0.0 \
    --port 8000

# Done when it says "Starting server on 0.0.0.0:8000"

Once the server is running, you can test with

python3 -m sglang.bench_serving \
    --model /storage/datasets/huggingface/models/Llama-3.3-70B-Instruct  \
    --base-url "http://127.0.0.1:8000" \
    --dataset-name "sharegpt" \
    --num-prompts 10 \
    --sharegpt-output-len 100 \
    --disable-ignore-eos \
    --max-concurrency 4 \
    --pd-separated

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-19T15:03:15Z

Summary of Changes

Hello @michelemarzollo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the PD disaggregation mechanism, specifically within the EAGLE/EAGLE3 speculative decoding algorithms. Previously, the system incorrectly allocated hidden state buffers based on the target model's configuration, leading to crashes when the draft and target models had differing hidden state dimensions. The fix ensures that these buffers are now correctly sized according to the draft model's specifications, enabling robust operation with diverse model architectures.

Highlights

Bug Fix for Hidden State Size Mismatch: Addresses a RuntimeError occurring in PD disaggregation when the target and draft models have different hidden state sizes, specifically when allocating buffers for the draft model's hidden states. This previously led to crashes.
Corrected Buffer Allocation Logic: Modifies the buffer allocation for hidden states in PD disaggregation to correctly use the hidden_size and dtype from the draft model's configuration (drafter_config) instead of the target model's configuration (model_config).
Introduction of drafter_config: A new drafter_config attribute is introduced in the scheduler to store the ModelConfig of the speculative draft model, ensuring its properties are correctly referenced for operations like buffer allocation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a bug where the disaggregation buffers for EAGLE were allocated using the target model's hidden state size, causing a crash when it differed from the draft model's size. The fix involves introducing drafter_config to hold the draft model's configuration and using it to get the correct hidden state size.

Your approach of storing drafter_config on the scheduler is reasonable. It's the central component for managing configurations, so it's a logical place for it.

I have a few suggestions to improve the robustness and correctness of the implementation:

The initialization of drafter_config should be more comprehensive to include other server arguments, not just the model path.
The usage of drafter_config in init_disaggregation should be safer to handle cases where disaggregation might be used with non-EAGLE algorithms, preventing potential crashes.

Details are in the line comments. Overall, this is a good fix for the reported issue.

gemini-code-assist · 2025-11-19T15:05:23Z

python/sglang/srt/managers/scheduler.py

+            self.drafter_config = ModelConfig(
+                self.server_args.speculative_draft_model_path
+            )


The ModelConfig for the draft model is initialized with only the model path. This is incorrect as it misses other important configurations from server_args like trust_remote_code, revision, dtype, quantization, etc. This can lead to incorrect behavior or crashes. You should use ModelConfig.from_server_args to ensure all relevant arguments are passed, similar to how it's done for the draft worker in TpModelWorker.

Suggested change

self.drafter_config = ModelConfig(

self.server_args.speculative_draft_model_path

)

self.drafter_config = ModelConfig.from_server_args(

self.server_args,

model_path=self.server_args.speculative_draft_model_path,

model_revision=self.server_args.speculative_draft_model_revision,

is_draft_model=True,

override_config_file=self.server_args.decrypted_draft_config_file,

)

gemini-code-assist · 2025-11-19T15:05:23Z

python/sglang/srt/managers/scheduler.py

+                hidden_size=self.drafter_config.hf_text_config.hidden_size,
+                hidden_states_dtype=self.drafter_config.dtype,


This will raise an AttributeError if disaggregation is enabled for a non-EAGLE speculative algorithm, because self.drafter_config would be None. To make this more robust, you should fall back to self.model_config if self.drafter_config is not available. This maintains the previous behavior for non-EAGLE cases while fixing the bug for EAGLE.

Suggested change

hidden_size=self.drafter_config.hf_text_config.hidden_size,

hidden_states_dtype=self.drafter_config.dtype,

hidden_size=(self.drafter_config or self.model_config).hf_text_config.hidden_size,

hidden_states_dtype=(self.drafter_config or self.model_config).dtype,

No need to pass some dimension, if we don't have any states to move: let's pass 0.

gemini-code-assist · 2025-11-19T15:05:24Z

python/sglang/srt/managers/scheduler.py

+                hidden_size=self.drafter_config.hf_text_config.hidden_size,
+                hidden_states_dtype=self.drafter_config.dtype,


Similar to the issue in the DisaggregationMode.DECODE branch, this will crash if disaggregation is enabled for a non-EAGLE speculative algorithm. You should fall back to self.model_config to prevent this.

Suggested change

hidden_size=self.drafter_config.hf_text_config.hidden_size,

hidden_states_dtype=self.drafter_config.dtype,

hidden_size=(self.drafter_config or self.model_config).hf_text_config.hidden_size,

hidden_states_dtype=(self.drafter_config or self.model_config).dtype,

Same as above.

Currently, the buffers to move the draft model's hidden states are allocated based on the target model's hidden state size. This works only if target and draft model have the same hidden state size.

ZeldaHuang · 2025-11-20T06:39:37Z

hi @michelemarzollo , you can try to change

sglang/python/sglang/srt/managers/scheduler.py

Line 867 in 7dcf910

hidden_size=self.model_config.hf_text_config.hidden_size,

and

sglang/python/sglang/srt/managers/scheduler.py

Line 912 in 7dcf910

hidden_size=self.model_config.hf_text_config.hidden_size,

to self.draft_worker.model_config.hidden_size when use eagle3

ShangmingCai · 2025-11-20T08:15:07Z

@michelemarzollo Thx for the bug report and the PR, will review this PR today.

ShangmingCai · 2025-11-20T11:38:31Z

python/sglang/srt/disaggregation/utils.py

-            # For PD + spec decode
-            self.output_topk_p = torch.zeros(
-                (size, 16), dtype=torch.float32, device=device
-            )
-            self.output_topk_index = torch.zeros(
-                (size, 16), dtype=torch.int64, device=device
-            )
-            self.output_hidden_states = torch.zeros(
-                (size, hidden_size), dtype=hidden_states_dtype, device=device
-            )
+            self.require_hidden_states = require_hidden_states
+            if self.require_hidden_states:
+                # For PD + spec decode
+                self.output_topk_p = torch.zeros(
+                    (size, 16), dtype=torch.float32, device=device
+                )
+                self.output_topk_index = torch.zeros(
+                    (size, 16), dtype=torch.int64, device=device
+                )
+                self.output_hidden_states = torch.zeros(
+                    (size, hidden_size), dtype=hidden_states_dtype, device=device
+                )
+            else:
+                # Other methods don't need hidden states
+                self.output_topk_p = torch.empty(
+                    (size, 0), dtype=torch.float32, device=device
+                )
+                self.output_topk_index = torch.empty(
+                    (size, 0), dtype=torch.int64, device=device
+                )
+                self.output_hidden_states = torch.empty(
+                    (size, 0), dtype=hidden_states_dtype, device=device
+                )


Do these changes affect the correctness of get_buf_infos, get_buf as well? I see you only modified the set_buf interface.

ShangmingCai · 2025-11-20T11:43:12Z

hi @michelemarzollo , you can try to change

sglang/python/sglang/srt/managers/scheduler.py

Line 867 in 7dcf910

hidden_size=self.model_config.hf_text_config.hidden_size,

and

sglang/python/sglang/srt/managers/scheduler.py

Line 912 in 7dcf910

hidden_size=self.model_config.hf_text_config.hidden_size,

to self.draft_worker.model_config.hidden_size when use eagle3

@michelemarzollo Can you try @ZeldaHuang 's method? I think we should minimize the changes. We don't need to filter out the hidden state since the metadata buffer slot is way smaller compared to the kvcache, actually. But anyway, feel free to do some experiments to verify the overhead. It would be great if we could keep the code clean and reduce the unnecessary overhead.

michelemarzollo · 2025-11-20T12:42:47Z

Thanks a lot to both of you for the comments! Agree with all of them. Now I get the hidden state size from the drafter, and otherwise pass 0 (I don't see any reason for passing the model's hidden state, I find it even misleading). I don't modify anymore the utils file (very little overhead indeed, especially because the size is 0 if not eagle). Let me know if there is still something you don't like!
I ran a throughput test for eagle, ngram and autoregressive and things look as expected.

ShangmingCai · 2025-11-20T12:48:17Z

Thanks a lot to both of you for the comments! Agree with all of them. Now I get the hidden state size from the drafter, and otherwise pass 0 (I don't see any reason for passing the model's hidden state, I find it even misleading). I don't modify anymore the utils file (very little overhead indeed, especially because the size is 0 if not eagle). Let me know if there is still something you don't like! I ran a throughput test for eagle, ngram and autoregressive and things look as expected.

I am not quite sure whether pass 0 is safe for all scenes, maybe a padding value like 64 (RDMA granularity) sounds safer to me.

michelemarzollo · 2025-11-20T12:57:46Z

Sure, 64 looks good to me. I don't have such an understanding to know if there could be any issues, so I trust you. I can even leave the target's hidden size if you prefer, I just couldn't get why :)

ShangmingCai · 2025-11-20T15:38:29Z

/tag-and-rerun-ci

ShangmingCai

The only failed CI test is irrelevant:

This PR LGTM, and I think it is ready-to-merge now. I will bypass CI after a small comment fix.

ShangmingCai · 2025-11-21T07:32:13Z

Minor modifications:

Since float32 is 4 Bytes, so padding the unused hidden size to 16 is enough.
Add @ZeldaHuang as co-author.

* [model-gateway] update workflow names for gateway and exclude npu (sgl-project#13415) * [Tiny fix] Fix bench_speculative.py run bug (sgl-project#13416) * [model-gateway] Add Gateway Release Tooling (sgl-project#13420) * fix uneven PP layer indices (sgl-project#13282) Co-authored-by: Xuchun Shang <[email protected]> * diffusion: fix wan2.2 ti2v num_frames adjust logic (sgl-project#13379) Co-authored-by: adarshxs <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> * [PD][bug fix] fix memleak when last_batch is none (sgl-project#13144) Signed-off-by: Xuchun Shang <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * Fix cache_tokens calculate issue when retracted (sgl-project#11900) Signed-off-by: Michael Qiu <[email protected]> Co-authored-by: Mike_Qiu <[email protected]> * [feature] Custom base path on FastAPI server (sgl-project#5879) Co-authored-by: lianhu.yin <[email protected]> Co-authored-by: kebyn <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> * Adding user defined hooks support (sgl-project#13217) * Fix log time stats (sgl-project#13418) * [Ci tiny fix] Lower score threshold in evaluation test (sgl-project#13443) * diffusion: fix loading with local model_path (sgl-project#13445) * [2/N] CI refactor: sperate some backend-independent CPU tasks. (sgl-project#13447) * Temporarily disable model hooks CI (sgl-project#13450) * [Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa (sgl-project#13022) Signed-off-by: Hao Lu <[email protected]> * Remove verbs from GET endpoint paths to follow REST standards (sgl-project#13273) * Add missing models (sgl-project#13456) * extend sagemaker.Dockerfile serve script to allow all sglang serve flags (sgl-project#13173) * Fix 8-gpu B200 nightly tests (sgl-project#13457) * Fixes validation errors for Wan-AI models which store model weights in subdirectories (sgl-project#13461) * [Embeddings Performance Testing] Add performance test for embedding models (sgl-project#12359) * [NVIDIA] Fix broken fp8 MoE of deepseek v3 (sgl-project#13264) Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Kangyan-Zhou <[email protected]> * Temporarily comment out multimodal gen test to recover runners (sgl-project#13463) * Update pr-test.yml to fix invalid job name error * Add interface_v1 option for dynamic HiCache backend (sgl-project#13140) Co-authored-by: Zhiqiang Xie <[email protected]> * Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 (sgl-project#13455) * fix MambaPool clear method after refactoring (sgl-project#13449) * [AMD CI] Update sgl-router python path in dockerfile. (sgl-project#13458) * [CI] re-enable test_vision_openai_server_a ci (sgl-project#13444) * Adding CI Monitor Improvements (sgl-project#13462) * [GLM4.6v] Required changes for bumping up to transformer 5.x (sgl-project#13229) * [GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release (sgl-project#13258) * [model-gateway] use worker startup time out for worker registration (sgl-project#13473) * model: support JetVLM (sgl-project#13289) * chore: add an unified server arg for multimodal inputs preprocess config(sgl-project#12149) Co-authored-by: bianfeng <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> * [PD] Clarify init method docstrings for kvsender and kvreceiver (sgl-project#13476) * Fix lora test (sgl-project#13479) * [Piecewise CUDA Graph] Support ModelOpt FP8 (sgl-project#13094) * CI: fix NFS EBUSY error in PR test workflow (sgl-project#13460) Co-authored-by: Kangyan-Zhou <[email protected]> Co-authored-by: Mick <[email protected]> * [CI] fix triggered by a non-run-ci label (sgl-project#13393) * [CI] remove auto-labeling `run-ci` label. (sgl-project#13486) * fix: change performance log directory to cache path (sgl-project#13482) Co-authored-by: Mick <[email protected]> * [CI] Add input for pr-gate (sgl-project#13491) * [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel (sgl-project#13374) * [CI] fix lint yml (syntax error) (sgl-project#13496) * [VLM][feat] Support encoder DP for Qwen2.5-VL (sgl-project#13126) Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: liusy58 <[email protected]> Co-authored-by: Yuan Luo <[email protected]> * [HiCache] Critical fix to host memory double free (sgl-project#13501) Co-authored-by: Hao Chen <[email protected]> * [BugFix] Accuracy and function Issue when run ptpc quant model (sgl-project#13157) Co-authored-by: yuechguo <[email protected]> * fix: create git tags directly instead of temporary branches (sgl-project#13168) * Add .github/CI_PERMISSIONS.json to define the CI permissions (sgl-project#13509) Co-authored-by: sglang-bot <[email protected]> * README.md -> FOLDER_README.md (sgl-project#13510) Co-authored-by: sglang-bot <[email protected]> * Use slash command to trigger CI (sgl-project#13512) Co-authored-by: sglang-bot <[email protected]> * Add docs on trigger ci (sgl-project#13513) Co-authored-by: sglang-bot <[email protected]> * [Feature] Re:Enable hybrid mem saver (sgl-project#12962) * Trigger CI retry with edit (sgl-project#13516) Co-authored-by: sglang-bot <[email protected]> * Update docs (sgl-project#13519) Co-authored-by: sglang-bot <[email protected]> * Add /tag-and-rerun-ci (sgl-project#13521) * [CI] update pr-gate to be compatible with new slash triggering mananer. (sgl-project#13522) * [CI] fix skipping pr-gate on main (sgl-project#13525) * Small cleanups related to LoRA weight loading (sgl-project#13474) * [CI] fix CI skipped on main (sgl-project#13527) * [model-gateway] fix gateway docker build due to recent py code change (sgl-project#13532) * [model-gateway] limit opened files in docker build to fix edge case (sgl-project#13536) * [docker] fix dockerfile naming for diffusion (sgl-project#13534) * fix lora test (sgl-project#13537) * Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing (sgl-project#13540) * [fix] Fixes accuracy issues caused by incorrect use of rope (sgl-project#13495) * Flashinfer TRTLLM-GEN-MoE + Qwen3 (sgl-project#13489) * [chore] Disable ccache for sgl-kernel release (sgl-project#13541) * Add Qwen/Qwen1.5-MoE-A2.7B to model list (sgl-project#13543) * [Fix] Fix DeepSeek V3 MTP on B200 (sgl-project#13548) * [router][grpc] Support num_reasoning_tokens in haromy models (sgl-project#13047) * [feat][Ascend][Mindspore]: support model-impl of mindspore (sgl-project#9234) * [AMD CI] Local cache fallback. (sgl-project#13452) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix amd 1 gpu basic test (sgl-project#13551) * [Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking (sgl-project#12740) * purge unnecessary env variable set in deterministic test (sgl-project#13481) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13542) * Add `lmsys/gpt-oss-20b-bf16` to model validation check (sgl-project#13557) * CI Failure Monitor Improvements (sgl-project#13558) * [RL] Allow passing tensors of different dtypes for FlattenedTensorBucket (sgl-project#13413) * [CI] Fix CUDA workflow's dependency. (sgl-project#13568) * [NPU] Adapt pr-gate for pr-test workflow & workflows refresh (sgl-project#13567) * Tiny enhance test suites sanity check (sgl-project#13589) * [3/N] CI refactor: move some manually triggered tests. (sgl-project#13448) * Support moe topk sigmoid kernel (sgl-project#13049) Co-authored-by: xuebi <[email protected]> * Expend compatibility check for all quantized MoE models (sgl-project#13465) Signed-off-by: Xinyuan Tong <[email protected]> * add https://github.com/netanel-haber to CI_PERMISSIONS.json (sgl-project#13577) * chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13570) * [Auto Sync] Update base_grammar_backend.py, collector.py (20251116) (sgl-project#13357) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <[email protected]> * [GDN] Remove unnecessary contiguous() (sgl-project#13604) * [GDN] Remove unnecessary conv state clone (sgl-project#13603) * [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL (sgl-project#13055) Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: Yuhao Yang <[email protected]> * [diffusion] CI: improve diffusion CI (sgl-project#13562) Co-authored-by: Adarsh Shirawalmath <[email protected]> * feat: support external custom models (sgl-project#13429) Co-authored-by: qiuxuan.lzw <[email protected]> Co-authored-by: Mick <[email protected]> * [CI fix] Fix image download failures in VLM CI tests (sgl-project#13613) * [NVIDIA] Add fp8 gemm benchmark on blackwell (sgl-project#13528) * [UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests (sgl-project#12379) * [diffusion] refactor: remove PreprocessorConfig (sgl-project#13248) * [diffusion] refactor: refactor pipeline folders (sgl-project#13253) * Add FP32 dtype support for RoPE - Part2 (sgl-project#13328) * [diffusion] fix: remove multimodal_gen redundant get_bool_env_var func (sgl-project#13583) Co-authored-by: Mick <[email protected]> * Add support for new aiter version (AR accuracy, is_shuffled PR) (sgl-project#13554) Co-authored-by: sogalin <[email protected]> * diffusion: improve baseline performance monitor (sgl-project#13614) * [Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) (sgl-project#13453) * [CI] Align metric units for CI rate limit (sgl-project#13633) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel (sgl-project#13617) Co-authored-by: root <[email protected]> Co-authored-by: jacky.cheng <[email protected]> * fix bench_speculative bug (sgl-project#13197) * Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" (sgl-project#13644) * [CI] optimize CI workflow info (sgl-project#13634) * CI: Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback (sgl-project#13637) * [CI] apply pr-gate for XPU (sgl-project#13663) * Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next (sgl-project#11577) * [10/n] decouple quantization impl from vllm dependency - fix import (sgl-project#13524) * Adding nightly tests as release guard for bot bump workflows (sgl-project#13655) * [DeepseekV3.2] Deepseek fp8 support for MHA path (sgl-project#12964) * Fix launch of `Olmo3` (sgl-project#13666) Signed-off-by: Vincent Zhong <[email protected]> * [Deepseek V3.2] Change indexer weights_proj to fp32 (sgl-project#13459) * enable csgmv automatically on cuda (sgl-project#13600) * Add nightly test CI monitor workflow (sgl-project#13038) * allow loras to be implicitly evicted and loaded based on max_loaded_loras (sgl-project#11526) * Test reorganization: Move tests to manual/ (sgl-project#13610) * [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 (sgl-project#13667) Co-authored-by: Minglei Zhu <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Oasis-Git <[email protected]> * Super tiny remove unused MiniMaxM2MLP class (sgl-project#13659) * Update quantization.md with new model resources (sgl-project#13677) * [model-gateway] add both python and rust cli alias (sgl-project#13678) * [diffusion] CI: improve validation method (sgl-project#13627) * [model-gateway] fix gateway cli arg parser to not use = (sgl-project#13685) * [CI] Move nightly tests to test/nightly/ (sgl-project#13683) * [NVIDIA] Add cutedsl e2e test to GB200 CI (sgl-project#12672) Co-authored-by: Baizhou Zhang <[email protected]> * Add sgl-kernel CI test for Blackwell (B200) (sgl-project#13301) * remove unnecessary starvation check (sgl-project#13619) * Fix target MLA with eagle3 support for PD disaggregation (sgl-project#13555) Signed-off-by: Michael Qiu <[email protected]> Co-authored-by: Mike_Qiu <[email protected]> * [kimi k2 thinking] Avoid useless torch.zeros_ (sgl-project#13596) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size (sgl-project#13587) * [VLM] Support Piecewise CUDA Graph for InternVL (sgl-project#13640) Co-authored-by: luoyuan.luo <[email protected]> * [Piecewise Cuda Graph] rename, refactor and add more logging (sgl-project#13675) Co-authored-by: Minglei Zhu <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Oasis-Git <[email protected]> * [difusion] CI: speed up multimodal_gen ci (sgl-project#13665) Co-authored-by: Mick <[email protected]> * [diffusion] doc: minor update docs (sgl-project#13177) * Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 (sgl-project#13686) * [diffusion] server: use meta to avoid Linear init for TextEncoder (sgl-project#13564) Co-authored-by: Mick <[email protected]> * [Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) (sgl-project#13679) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Zhuqi Li <[email protected]> * [Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers (sgl-project#13590) Co-authored-by: ZeldaHuang <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * [HiCache] fix unit test with changed new APIs (sgl-project#13498) * [Fix] Qwen3Next lmhead dtype (sgl-project#13708) * [NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 (sgl-project#13647) * [11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks (sgl-project#13327) Co-authored-by: Baizhou Zhang <[email protected]> * [Clean code] Compressed_tensors_moe code clean (sgl-project#13719) * [diffusion] profile: support performance metric dumping and comparison (sgl-project#13630) * [AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model (sgl-project#13705) Co-authored-by: yctseng0211 <[email protected]> * [diffusion] doc: add contributing.md (sgl-project#13649) * fix 3fs down, lock schedule main thread (sgl-project#13407) * Fix url: use https://roadmap.sglang.io for roadmap (sgl-project#13733) Co-authored-by: sglang-bot <[email protected]> * Super tiny delete unused files (sgl-project#13734) * [diffusion] log: minor improve logging (sgl-project#13735) * [CI] minor hot fix of model validation list (sgl-project#13737) * Add to ci permission (sgl-project#13739) * [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) (sgl-project#13466) Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> * Fix: CI monitor should not exit with error on regressions (sgl-project#13694) * Revert "enable csgmv automatically on cuda" (sgl-project#13707) * Support torch 12.9 + DeepEP by removing custom nvshmem (sgl-project#12949) Co-authored-by: Baizhou Zhang <[email protected]> * add some more labels (sgl-project#13701) Co-authored-by: Brayden Zhong <[email protected]> * Feat/nemotron nano v3 support (sgl-project#12690) * Fix global scaling factor loading hang (sgl-project#13484) * Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue (sgl-project#13746) * fix test_lora_update.py starvation message check (sgl-project#13702) * Fix model weights validation with automatic cache cleanup (sgl-project#13729) * [Auto Sync] Update evict_policy.py, radix_cache.py (20251120) (sgl-project#13669) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: cctry <[email protected]> * [Tiny] Renaming environ for NVFP4 dispatch (sgl-project#13756) * modularize gsm8k and mmmu test classes (sgl-project#13506) * Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) (sgl-project#9405) * [Ascend] support Kimi-K2-Thinking (sgl-project#12759) Co-authored-by: ZhengdQin <[email protected]> Co-authored-by: richhuan <[email protected]> Co-authored-by: ZhengdQin <[email protected]> * Refactor eagle bigram key matching (sgl-project#13714) * [diffusion] fix: fix hunyuanvideo and add 2-gpu ci test (sgl-project#13720) Co-authored-by: Mick <[email protected]> * Update mem checker during busy (sgl-project#13704) * Tiny support different prompts in `send_one.py` (sgl-project#13768) * [diffusion] refactor: refactor sampling params (sgl-project#13706) * [VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series (sgl-project#13736) Co-authored-by: luoyuan.luo <[email protected]> * [Spec v2] Remove `allocate_lens` and enable over-allocation (sgl-project#13478) * [diffusion] CI: tinyfix diffusion ci (sgl-project#13769) Co-authored-by: Mick <[email protected]> * align code style eagle draft&draft_extend cuda graph runner (sgl-project#13533) * Refactor MHA & MLA KV caches to support FP4 (sgl-project#13547) Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> * Move unnecessary input_addr capture under debug mode flag for speed-up (sgl-project#13690) * Gather static input buffers for cuda graph (sgl-project#13676) * Revert "Fix RMSNorm API CALL mismatch issue. (sgl-project#10032)" (sgl-project#13727) * [model-gateway] update smg code owner (sgl-project#13777) * [model-gateway] clean up router manager function order (sgl-project#13776) * Fix typo in docs (sgl-project#13709) * [Feature] HiCache JIT kernel (once again) (sgl-project#13764) * [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode (sgl-project#13787) * Upgrade flashmla kernel for NSA tp support (sgl-project#13718) * [diffusion] feat: support sp for image models (sgl-project#13180) * [diffusion] CI: add run_suite to multimodal_gen CI (sgl-project#13791) * Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection (sgl-project#13781) * [Scheduler] Tiny organize code style (sgl-project#13806) * [Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments (sgl-project#13687) * [CI] Tiny refactoring sgl-kernel tests (sgl-project#13813) * Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 (sgl-project#13815) * make trtllm attn backend's init_forward_metadat non blocking (sgl-project#13802) * remove package json which is not used (sgl-project#13810) * [1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell (sgl-project#13601) Co-authored-by: fy1214 * chore: bump sgl-kernel version to 0.3.18 (sgl-project#13816) * xgrammar up version to 0.1.27 (sgl-project#13650) * Fix bug: Incorrect variable used in rem_total_token_offset calculatio… (sgl-project#13201) * [Doc] Refine fused_moe_triton configs doc (sgl-project#13820) * Update MindSpore documentation (sgl-project#13656) Co-authored-by: wangtiance <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Refactor cache init logic (sgl-project#13800) * [Bugfix] Add jit kernel files in packaging (sgl-project#13829) Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: Xu Yongfei <[email protected]> * [diffusion] doc: minor update contributing.md with test section (sgl-project#13792) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [misc] Rename minilb install env & remove files & fix lint (sgl-project#13831) * [diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring (sgl-project#13833) Co-authored-by: Mick <[email protected]> * [chore]Upgrade flashinfer to 0.5.3 (sgl-project#13751) * [Intel XPU]support xgrammar backend for intel xpu (sgl-project#13245) * [sgl-kernel Code Clean] Remove useless lightning_attention kernel (sgl-project#13819) * [VLM] Revise InternVL Piecewise CUDA Graph Supporting (sgl-project#13846) Co-authored-by: luoyuan.luo <[email protected]> * Fix TorchAO quant in VLM (sgl-project#13508) Co-authored-by: qiuxuan.lzw <[email protected]> * [Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id (sgl-project#13713) Signed-off-by: vito.yy <[email protected]> * Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (sgl-project#11871) * [Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x (sgl-project#13612) Signed-off-by: lzy <[email protected]> Co-authored-by: lzy <[email protected]> * Tiny unpin uvloop for other backends (sgl-project#13858) * [model-gateway] Refactor router e2e responses tests (sgl-project#13745) Co-authored-by: Chang Su <[email protected]> Co-authored-by: Simo Lin <[email protected]> * [Perf] Optimize DeepSeek-R1 w4afp8 glue kernels (sgl-project#10027) Co-authored-by: Fan Yin <[email protected]> * Fix quantized moe checker fail for Qwen3 dense fp8 model (sgl-project#13853) * [model-gateway] add grpc server code owner (sgl-project#13865) * [BugFix] fix outplace_fused_experts missing is_gated (sgl-project#13864) * fix xgrammar_backend crash with malformed inputs (sgl-project#13752) * [Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) (sgl-project#13763) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Hanming Lu <[email protected]> Co-authored-by: Hanming Lu <[email protected]> * [Doc] Add an Introduction to Expert Parallelism (sgl-project#13783) * add LoRA warning if loading a preexisting LoRA adapter with a different name (sgl-project#13822) * [NPU] Fix NPU CI (sgl-project#13834) Co-authored-by: c30031083 <[email protected]> * Overlap glm moe gemms in two cuda streams (sgl-project#13786) * [Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) (sgl-project#13487) * Add support for bf16 x bf16 cutlass fused MoE (sgl-project#10275) Co-authored-by: Sam Li <[email protected]> Co-authored-by: jackeyhua <[email protected]> * [Router bugfix] Fix router_manager selecting the wrong router when enable-igw. (sgl-project#13572) * Fix nightly test job to fail when any test fails (sgl-project#13871) * [diffusion] refactor: remove training-related code (sgl-project#13860) * [CI] fix multimodel-gen-test job (sgl-project#13874) * [diffusion] CI: add validation and cleanup for corrupted safetensors in multimodal loader (sgl-project#13870) Co-authored-by: Mick <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] fix lint error (sgl-project#13891) * fix: draft model revision misuse model revision (sgl-project#11893) * Fix trace publish paths in nightly-test-nvidia workflow (sgl-project#13888) * Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 (sgl-project#13890) * [Fix] JIT kernel dependencies in other platforms (sgl-project#13889) * remove RoPE CPU fp32 tests (sgl-project#13827) Co-authored-by: Fan Yin <[email protected]> * Move test_dummy_grok_models.py from manual to srt (temporary) (sgl-project#13901) * [CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric (sgl-project#13793) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * update flashinfer_cubin==0.5.3 (sgl-project#13848) * fix * fix --------- Signed-off-by: Xuchun Shang <[email protected]> Signed-off-by: Michael Qiu <[email protected]> Signed-off-by: Hao Lu <[email protected]> Signed-off-by: Xinyuan Tong <[email protected]> Signed-off-by: Vincent Zhong <[email protected]> Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Signed-off-by: vito.yy <[email protected]> Signed-off-by: lzy <[email protected]> Co-authored-by: Simo Lin <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]> Co-authored-by: AlphaBaby <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Mick <[email protected]> Co-authored-by: adarshxs <[email protected]> Co-authored-by: Adarsh Shirawalmath <[email protected]> Co-authored-by: Xuchun Shang <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Mike Qiu <[email protected]> Co-authored-by: Mike_Qiu <[email protected]> Co-authored-by: kebyn <[email protected]> Co-authored-by: lianhu.yin <[email protected]> Co-authored-by: kebyn <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: Carlo Mussolini <[email protected]> Co-authored-by: Rain H <[email protected]> Co-authored-by: Liangsheng Yin <[email protected]> Co-authored-by: hlu1 <[email protected]> Co-authored-by: Kangyan-Zhou <[email protected]> Co-authored-by: Sirut Buasai <[email protected]> Co-authored-by: Vedant V Jhaveri <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: pansicheng <[email protected]> Co-authored-by: Zhiqiang Xie <[email protected]> Co-authored-by: Minglei Zhu <[email protected]> Co-authored-by: Sai Enduri <[email protected]> Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: Douglas Yang <[email protected]> Co-authored-by: Binyao Jiang <[email protected]> Co-authored-by: Zijian Zhang <[email protected]> Co-authored-by: wingedge <[email protected]> Co-authored-by: bianfeng <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]> Co-authored-by: b8zhong <[email protected]> Co-authored-by: alisonshao <[email protected]> Co-authored-by: Cheng Wan <[email protected]> Co-authored-by: Nicholas <[email protected]> Co-authored-by: liusy58 <[email protected]> Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: Hao Chen <[email protected]> Co-authored-by: Morpheus Guo <[email protected]> Co-authored-by: yuechguo <[email protected]> Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: sglang-bot <[email protected]> Co-authored-by: Junrong Lin <[email protected]> Co-authored-by: Glen Liu <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: gongwei-130 <[email protected]> Co-authored-by: Baidu-AIAK <[email protected]> Co-authored-by: Chen Haozhe <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ykwd <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> Co-authored-by: Even Zhou <[email protected]> Co-authored-by: Roger Young <[email protected]> Co-authored-by: xuebi <[email protected]> Co-authored-by: Netanel Haber <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Sehoon Kim <[email protected]> Co-authored-by: luoyuan.luo <[email protected]> Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: StonyPort <[email protected]> Co-authored-by: qiuxuan.lzw <[email protected]> Co-authored-by: Zeyu Li <[email protected]> Co-authored-by: iLeGend <[email protected]> Co-authored-by: joesun <[email protected]> Co-authored-by: Thomas Wang <[email protected]> Co-authored-by: sogalin <[email protected]> Co-authored-by: DarkSharpness <[email protected]> Co-authored-by: yctseng0211 <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: jacky.cheng <[email protected]> Co-authored-by: Lzhang-hub <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: Fan Yin <[email protected]> Co-authored-by: YAMY <[email protected]> Co-authored-by: Vincent Zhong <[email protected]> Co-authored-by: Stefan He <[email protected]> Co-authored-by: Ke Bao <[email protected]> Co-authored-by: Oasis-Git <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Co-authored-by: 赵晨阳 <[email protected]> Co-authored-by: ishandhanani <[email protected]> Co-authored-by: zyksir <[email protected]> Co-authored-by: Zhuqi Li <[email protected]> Co-authored-by: Michele Marzollo <[email protected]> Co-authored-by: ZeldaHuang <[email protected]> Co-authored-by: Teng Ma <[email protected]> Co-authored-by: weibingo <[email protected]> Co-authored-by: Jiajun Li <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Qiaolin Yu <[email protected]> Co-authored-by: roikoren755 <[email protected]> Co-authored-by: Shu Wang <[email protected]> Co-authored-by: cctry <[email protected]> Co-authored-by: Trevor Morris <[email protected]> Co-authored-by: Yijie Zhu <[email protected]> Co-authored-by: ZhengdQin <[email protected]> Co-authored-by: richhuan <[email protected]> Co-authored-by: ZhengdQin <[email protected]> Co-authored-by: yinghui <[email protected]> Co-authored-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: ErsongWang <[email protected]> Co-authored-by: Peiqi Yin <[email protected]> Co-authored-by: Swipe4057 <[email protected]> Co-authored-by: liuhuijiayou <[email protected]> Co-authored-by: Tiance Wang <[email protected]> Co-authored-by: wangtiance <[email protected]> Co-authored-by: Xu Yongfei <[email protected]> Co-authored-by: gaopengff <[email protected]> Co-authored-by: ant-yy <[email protected]> Co-authored-by: Zhi Yiliu <[email protected]> Co-authored-by: lzy <[email protected]> Co-authored-by: Xinyue Zhang <[email protected]> Co-authored-by: Yuhao Yao <[email protected]> Co-authored-by: Hanming Lu <[email protected]> Co-authored-by: Hanming Lu <[email protected]> Co-authored-by: c30031083 <[email protected]> Co-authored-by: Nicolas Castet <[email protected]> Co-authored-by: Sam Li <[email protected]> Co-authored-by: jackeyhua <[email protected]> Co-authored-by: Siyuan Chen <[email protected]> Co-authored-by: Yibo Cai <[email protected]> Co-authored-by: Yibo Cai <[email protected]> Co-authored-by: Zaili Wang <[email protected]> Co-authored-by: josephyou <[email protected]>

michelemarzollo requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners November 19, 2025 15:03

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

michelemarzollo marked this pull request as draft November 19, 2025 15:16

michelemarzollo added 3 commits November 19, 2025 17:48

Fix hidden state size in EAGLE PD disaggregation buffers

65d4dcd

Currently, the buffers to move the draft model's hidden states are allocated based on the target model's hidden state size. This works only if target and draft model have the same hidden state size.

Fix formatting

d7c44ed

Treat non-eagle cases

f5df06b

michelemarzollo force-pushed the eagle-dis-pd-bugfix branch from f7bfa4d to f5df06b Compare November 19, 2025 16:53

michelemarzollo marked this pull request as ready for review November 19, 2025 17:03

michelemarzollo requested review from ByronHsu and ShangmingCai as code owners November 19, 2025 17:03

Merge branch 'main' into eagle-dis-pd-bugfix

8443398

Merge branch 'main' into eagle-dis-pd-bugfix

6587356

ShangmingCai reviewed Nov 20, 2025

View reviewed changes

Address PR comments

68fb6d8

Change default hidden size to 64 for safety reasons

10f91df

github-actions bot added the run-ci label Nov 20, 2025

ShangmingCai approved these changes Nov 21, 2025

View reviewed changes

fix default hidden size for RDMA padding and fix comment

06ce6ea

ShangmingCai merged commit b30f63c into sgl-project:main Nov 21, 2025
26 of 53 checks passed

-            self.drafter_config = ModelConfig(
-                self.server_args.speculative_draft_model_path
-            )
+            self.drafter_config = ModelConfig.from_server_args(
+                self.server_args,
+                model_path=self.server_args.speculative_draft_model_path,
+                model_revision=self.server_args.speculative_draft_model_revision,
+                is_draft_model=True,
+                override_config_file=self.server_args.decrypted_draft_config_file,
+            )

		hidden_size=self.drafter_config.hf_text_config.hidden_size,
		hidden_states_dtype=self.drafter_config.dtype,

[Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers #13590

[Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers #13590

Uh oh!

Conversation

michelemarzollo commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

michelemarzollo Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

michelemarzollo Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

michelemarzollo Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang commented Nov 20, 2025

Uh oh!

ShangmingCai commented Nov 20, 2025

Uh oh!

ShangmingCai Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Nov 20, 2025

Uh oh!

michelemarzollo commented Nov 20, 2025

Uh oh!

ShangmingCai commented Nov 20, 2025

Uh oh!

michelemarzollo commented Nov 20, 2025

Uh oh!

ShangmingCai commented Nov 20, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michelemarzollo commented Nov 19, 2025 •

edited

Loading