Skip to content

Comments

[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode#12443

Merged
hnyls2002 merged 18 commits intosgl-project:mainfrom
iforgetmyname:feature/mtp_1
Nov 15, 2025
Merged

[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode#12443
hnyls2002 merged 18 commits intosgl-project:mainfrom
iforgetmyname:feature/mtp_1

Conversation

@iforgetmyname
Copy link
Collaborator

@iforgetmyname iforgetmyname commented Oct 31, 2025

Motivation

Since SGLang has already implemented speculative decoding with overlapped scheduler, we would like to enhance such a feature with supporting DP-Attention with MoE models and PD Disaggregation.
Moreover, this PR also introduced NPUGraph support for draft worker v2 that heavily reduce launching overheads.

Modifications

  • Dealing with padding issues and idle batches when dp-attention is enabled
  • Pre-alloc enough space supposing all draft tokens will be accpeted under pd disaggregation
  • add draft worker npugraph support, extracting abstract functions from cudagraph

Accuracy Tests

Benchmarking and Profiling

Bechmarking on NPU:
image

Profiling (CUDA profilings are reduced to only 4 layers):
https://drive.google.com/drive/folders/1qRrbQTO-2ia-23N4yevHVKk2Jr2ZrGDX?usp=sharing

Checklist

Copy link
Member

@sglang-bot sglang-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. upload torch profile to show that the overlap actually happens and there is no CPU overhead.
    • Rule: If you change any logic in (overlap) scheduler, attach a torch profile to show the overlap actually happens.
  2. compare the speed and acceptance length of overlap vs. non-overlap
  3. Get someone to verify on GPU
  4. add a test case

Copy link
Member

@sglang-bot sglang-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShangmingCai
Copy link
Collaborator

So is it ready for review now? Will find some time this week or maybe next week to finish the PD part.

@ping1jing2
Copy link
Collaborator

So is it ready for review now? Will find some time this week or maybe next week to finish the PD part.

yes, it's ok, will finish description today, you can review it firstly, thank you.

new_batch.process_prebuilt_extend(self.server_args, self.model_config)

if self.spec_algorithm.is_eagle() and self.enable_overlap:
new_batch.spec_info.future_indices = self.future_map.alloc_future_indices(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this assignment into new_batch.process_prebuilt_extend(), pass in self.future_map as args

):
intv = future_indices.interval
if self.spec_algo.is_eagle():
if self.is_empty_slice(intv):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract these logics into a separate func

verified_id=self.output_ids,
new_seq_lens=self.seq_lens,
allocate_lens=self.seq_lens,
num_tokens_per_batch=1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments here

(
(
self.max_bs * self.num_tokens_per_bs
if not hasattr(eagle_worker, "model_runner")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.forward_mode = ForwardMode.DRAFT_EXTEND_V2

self.future_indices = FutureIndices(
indices=torch.cat(
[self.future_indices.indices, spec_info.future_indices.indices]
if spec_info.future_indices is not None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this

verified_id=next_token_ids,
new_seq_lens=batch.seq_lens,
allocate_lens=batch.seq_lens,
num_tokens_per_batch=1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments

@iforgetmyname iforgetmyname changed the title [NPU] support mtp(beta) pd disaggregation and dp attention [Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode Nov 8, 2025
@hnyls2002
Copy link
Collaborator

@iforgetmyname
Copy link
Collaborator Author

iforgetmyname commented Nov 12, 2025

CI is broken https://github.com/sgl-project/sglang/actions/runs/19293114439/job/55183519002?pr=12443#step:5:3666

I see, I copied this test case from sglang/test/srt/test_eagle_dp_attention.py and just found that some of the specs are not supported by spec-overlap now. shoud pick up another successful case and modify it to support dp-attention, any suggestions on such a case?

Update: copied a case from test_deepseek_v3_fp4_4gpu.py which works with spec-overlap originally and the model supports dp-attention

@hnyls2002 hnyls2002 self-assigned this Nov 13, 2025
# Conflicts:
#	python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py
#	python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
@hnyls2002 hnyls2002 merged commit 2aec8b6 into sgl-project:main Nov 15, 2025
173 of 210 checks passed
@iforgetmyname iforgetmyname deleted the feature/mtp_1 branch November 15, 2025 13:52
@attack204
Copy link
Contributor

I noticed that the draft_extend_for_decode stage in EagleDraftWorker currently only supports NPUGraph. I'd like to ask when cudagraph support will be added, or what challenges are being encountered? If it's due to other higher priority tasks, I can help contribute to the support.

@iforgetmyname

    if self.draft_extend_attn_backend and _is_npu:  
        tic = time.perf_counter()  
        before_mem = get_available_gpu_memory(self.device, self.gpu_id)  
        logger.info(  
            f"Capture draft extend cuda graph begin. This can take up to several minutes. avail mem={before_mem:.2f} GB"  
        )  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants