[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode#12443
[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode#12443hnyls2002 merged 18 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
- upload torch profile to show that the overlap actually happens and there is no CPU overhead.
- Rule: If you change any logic in (overlap) scheduler, attach a torch profile to show the overlap actually happens.
- compare the speed and acceptance length of overlap vs. non-overlap
- Get someone to verify on GPU
- add a test case
c9f6e76 to
58f2de8
Compare
a4834dd to
a92651c
Compare
|
So is it ready for review now? Will find some time this week or maybe next week to finish the PD part. |
yes, it's ok, will finish description today, you can review it firstly, thank you. |
10fcb60 to
de4637b
Compare
| new_batch.process_prebuilt_extend(self.server_args, self.model_config) | ||
|
|
||
| if self.spec_algorithm.is_eagle() and self.enable_overlap: | ||
| new_batch.spec_info.future_indices = self.future_map.alloc_future_indices( |
There was a problem hiding this comment.
move this assignment into new_batch.process_prebuilt_extend(), pass in self.future_map as args
| ): | ||
| intv = future_indices.interval | ||
| if self.spec_algo.is_eagle(): | ||
| if self.is_empty_slice(intv): |
There was a problem hiding this comment.
extract these logics into a separate func
| verified_id=self.output_ids, | ||
| new_seq_lens=self.seq_lens, | ||
| allocate_lens=self.seq_lens, | ||
| num_tokens_per_batch=1, |
There was a problem hiding this comment.
add comments here
| ( | ||
| ( | ||
| self.max_bs * self.num_tokens_per_bs | ||
| if not hasattr(eagle_worker, "model_runner") |
There was a problem hiding this comment.
use self.forward_mode = ForwardMode.DRAFT_EXTEND_V2
| self.future_indices = FutureIndices( | ||
| indices=torch.cat( | ||
| [self.future_indices.indices, spec_info.future_indices.indices] | ||
| if spec_info.future_indices is not None: |
| verified_id=next_token_ids, | ||
| new_seq_lens=batch.seq_lens, | ||
| allocate_lens=batch.seq_lens, | ||
| num_tokens_per_batch=1, |
c99baeb to
683973f
Compare
I see, I copied this test case from Update: copied a case from |
# Conflicts: # python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py # python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
|
I noticed that the draft_extend_for_decode stage in EagleDraftWorker currently only supports NPUGraph. I'd like to ask when cudagraph support will be added, or what challenges are being encountered? If it's due to other higher priority tasks, I can help contribute to the support. if self.draft_extend_attn_backend and _is_npu:
tic = time.perf_counter()
before_mem = get_available_gpu_memory(self.device, self.gpu_id)
logger.info(
f"Capture draft extend cuda graph begin. This can take up to several minutes. avail mem={before_mem:.2f} GB"
) |
Motivation
Since SGLang has already implemented speculative decoding with overlapped scheduler, we would like to enhance such a feature with supporting DP-Attention with MoE models and PD Disaggregation.
Moreover, this PR also introduced NPUGraph support for draft worker v2 that heavily reduce launching overheads.
Modifications
Accuracy Tests
Benchmarking and Profiling
Bechmarking on NPU:

Profiling (CUDA profilings are reduced to only 4 layers):
https://drive.google.com/drive/folders/1qRrbQTO-2ia-23N4yevHVKk2Jr2ZrGDX?usp=sharing
Checklist