[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode by iforgetmyname · Pull Request #12443 · sgl-project/sglang

iforgetmyname · 2025-10-31T07:12:13Z

Motivation

Since SGLang has already implemented speculative decoding with overlapped scheduler, we would like to enhance such a feature with supporting DP-Attention with MoE models and PD Disaggregation.
Moreover, this PR also introduced NPUGraph support for draft worker v2 that heavily reduce launching overheads.

Modifications

Dealing with padding issues and idle batches when dp-attention is enabled
Pre-alloc enough space supposing all draft tokens will be accpeted under pd disaggregation
add draft worker npugraph support, extracting abstract functions from cudagraph

Accuracy Tests

Benchmarking and Profiling

Bechmarking on NPU:

Profiling (CUDA profilings are reduced to only 4 layers):
https://drive.google.com/drive/folders/1qRrbQTO-2ia-23N4yevHVKk2Jr2ZrGDX?usp=sharing

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

sglang-bot

upload torch profile to show that the overlap actually happens and there is no CPU overhead.
- Rule: If you change any logic in (overlap) scheduler, attach a torch profile to show the overlap actually happens.
compare the speed and acceptance length of overlap vs. non-overlap
Get someone to verify on GPU
add a test case

sglang-bot

cc @ShangmingCai @hnyls2002

python/sglang/srt/speculative/eagle_info_v2.py

python/sglang/srt/speculative/eagle_worker.py

ShangmingCai · 2025-11-04T09:28:33Z

So is it ready for review now? Will find some time this week or maybe next week to finish the PD part.

ping1jing2 · 2025-11-04T09:33:21Z

So is it ready for review now? Will find some time this week or maybe next week to finish the PD part.

yes, it's ok, will finish description today, you can review it firstly, thank you.

iforgetmyname · 2025-11-08T07:16:57Z

python/sglang/srt/disaggregation/decode.py

        new_batch.process_prebuilt_extend(self.server_args, self.model_config)
-
+        if self.spec_algorithm.is_eagle() and self.enable_overlap:
+            new_batch.spec_info.future_indices = self.future_map.alloc_future_indices(


move this assignment into new_batch.process_prebuilt_extend(), pass in self.future_map as args

iforgetmyname · 2025-11-08T07:18:01Z

python/sglang/srt/managers/overlap_utils.py

    ):
        intv = future_indices.interval
        if self.spec_algo.is_eagle():
+            if self.is_empty_slice(intv):


extract these logics into a separate func

python/sglang/srt/managers/overlap_utils.py

iforgetmyname · 2025-11-08T07:22:55Z

python/sglang/srt/disaggregation/decode_schedule_batch_mixin.py

                verified_id=self.output_ids,
+                new_seq_lens=self.seq_lens,
+                allocate_lens=self.seq_lens,
+                num_tokens_per_batch=1,


add comments here

iforgetmyname · 2025-11-08T07:27:04Z

python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

+                (
+                    (
+                        self.max_bs * self.num_tokens_per_bs
+                        if not hasattr(eagle_worker, "model_runner")


use self.forward_mode = ForwardMode.DRAFT_EXTEND_V2

python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

iforgetmyname · 2025-11-08T07:33:18Z

python/sglang/srt/speculative/eagle_info.py

-            self.future_indices = FutureIndices(
-                indices=torch.cat(
-                    [self.future_indices.indices, spec_info.future_indices.indices]
+            if spec_info.future_indices is not None:


revert this

iforgetmyname · 2025-11-08T07:35:28Z

python/sglang/srt/speculative/eagle_worker_v2.py

            verified_id=next_token_ids,
            new_seq_lens=batch.seq_lens,
            allocate_lens=batch.seq_lens,
+            num_tokens_per_batch=1,


add comments

… feature/mtp_1

hnyls2002 · 2025-11-12T13:40:26Z

CI is broken https://github.com/sgl-project/sglang/actions/runs/19293114439/job/55183519002?pr=12443#step:5:3666

iforgetmyname · 2025-11-12T13:50:11Z

CI is broken https://github.com/sgl-project/sglang/actions/runs/19293114439/job/55183519002?pr=12443#step:5:3666

I see, I copied this test case from sglang/test/srt/test_eagle_dp_attention.py and just found that some of the specs are not supported by spec-overlap now. shoud pick up another successful case and modify it to support dp-attention, any suggestions on such a case?

Update: copied a case from test_deepseek_v3_fp4_4gpu.py which works with spec-overlap originally and the model supports dp-attention

# Conflicts: # python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py # python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

attack204 · 2025-11-26T13:54:26Z

I noticed that the draft_extend_for_decode stage in EagleDraftWorker currently only supports NPUGraph. I'd like to ask when cudagraph support will be added, or what challenges are being encountered? If it's due to other higher priority tasks, I can help contribute to the support.

@iforgetmyname

    if self.draft_extend_attn_backend and _is_npu:  
        tic = time.perf_counter()  
        before_mem = get_available_gpu_memory(self.device, self.gpu_id)  
        logger.info(  
            f"Capture draft extend cuda graph begin. This can take up to several minutes. avail mem={before_mem:.2f} GB"  
        )

…tion; npugraph mode #12443

sglang-bot added the run-ci label Oct 31, 2025

sglang-bot requested changes Nov 2, 2025

View reviewed changes

sglang-bot reviewed Nov 2, 2025

View reviewed changes

sglang-bot requested changes Nov 2, 2025

View reviewed changes

python/sglang/srt/speculative/eagle_info_v2.py Outdated Show resolved Hide resolved

python/sglang/srt/speculative/eagle_worker.py Outdated Show resolved Hide resolved

liupeng374 force-pushed the feature/mtp_1 branch from c9f6e76 to 58f2de8 Compare November 3, 2025 02:50

iforgetmyname marked this pull request as ready for review November 3, 2025 10:12

iforgetmyname requested review from BBuf, ByronHsu, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kssteven418, kushanam, merrymercy, ping1jing2, xiezhq-hermann and zhyncs as code owners November 3, 2025 10:12

liupeng374 force-pushed the feature/mtp_1 branch from a4834dd to a92651c Compare November 4, 2025 03:21

github-actions bot added deepseek speculative-decoding labels Nov 6, 2025

liupeng374 force-pushed the feature/mtp_1 branch 2 times, most recently from 10fcb60 to de4637b Compare November 8, 2025 06:46

liupeng374 requested a review from Fridge003 as a code owner November 8, 2025 06:46

iforgetmyname commented Nov 8, 2025

View reviewed changes

iforgetmyname changed the title ~~[NPU] support mtp(beta) pd disaggregation and dp attention~~ [Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode Nov 8, 2025

liupeng374 force-pushed the feature/mtp_1 branch from c99baeb to 683973f Compare November 10, 2025 06:30

iforgetmyname added 11 commits November 10, 2025 21:55

bump sgl-kernel-npu version

732dfe4

Merge branch 'main' into feature/mtp_1

ecb5571

add tests for spec-overlap&dp-attn

0d1dcc1

Merge branch 'main' into feature/mtp_1

2ad8424

change dp attention testcase to 2gpu

350c6f0

Merge branch 'main' into feature/mtp_1

450db98

fix linting

eb01a5e

Merge remote-tracking branch 'refs/remotes/origin/feature/mtp_1' into…

7ea028a

… feature/mtp_1

fix linting

89d8788

fix fia failed under previous driver version

548fcd0

Merge branch 'main' into feature/mtp_1

bd4b50d

hnyls2002 approved these changes Nov 12, 2025

View reviewed changes

iforgetmyname and others added 3 commits November 12, 2025 22:10

fix testcases

ce5576f

Merge branch 'main' into feature/mtp_1

5a960af

Merge branch 'main' into feature/mtp_1

8170544

hnyls2002 self-assigned this Nov 13, 2025

liupeng374 added 3 commits November 14, 2025 09:35

Merge remote-tracking branch 'upstream_main/main' into feature/mtp_1

16e2dbe

# Conflicts: # python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py # python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

Merge branch 'main' into feature/mtp_1

37c3b9c

Merge branch 'main' into feature/mtp_1

27269f7

hnyls2002 merged commit 2aec8b6 into sgl-project:main Nov 15, 2025
173 of 210 checks passed

iforgetmyname deleted the feature/mtp_1 branch November 15, 2025 13:52

ispobock pushed a commit that referenced this pull request Dec 1, 2025

cherry-pick: [Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggrega…

dbe863c

…tion; npugraph mode #12443

attack204 mentioned this pull request Dec 4, 2025

feat(SpecEagleV2): add standalone_worker_v2 #12625

Merged

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

Comments

Conversation

iforgetmyname commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

sglang-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sglang-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ShangmingCai commented Nov 4, 2025

Uh oh!

ping1jing2 commented Nov 4, 2025

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Nov 12, 2025

Uh oh!

iforgetmyname commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

attack204 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

iforgetmyname commented Oct 31, 2025 •

edited

Loading

sglang-bot left a comment •

edited

Loading

iforgetmyname commented Nov 12, 2025 •

edited

Loading