[feature] implement dcp for deepseek_v2#14194
[feature] implement dcp for deepseek_v2#14194staugust wants to merge 22 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @staugust, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces Decode Context Parallel (DCP) capabilities, primarily targeting the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces Decode Context Parallelism (DCP) for the deepseek_v2 model, which is a significant enhancement for distributed inference. The implementation correctly integrates DCP by interleaving KV cache storage across ranks and adjusting attention calculations. Key changes include modifications to parallel state management, a new DcpTokenToKVPoolAllocator for interleaved KV cache, and updates to attention backends (flashinfer_mla_backend.py, flashmla_backend.py, deepseek_v2.py) to handle DCP-specific logic for query components and attention output. A new test file test_dcp_interleaved_storage.py has been added, providing good coverage for the new allocator logic. Overall, the changes are well-structured and address the requirements for DCP.
| # the input_tensor contiguous. Possible bug in reduce_scatter_tensor? | ||
| input_tensor = input_.movedim(0, dim).contiguous() | ||
|
|
||
| assert input_tensor.shape[0] % world_size == 0 |
There was a problem hiding this comment.
The assertion input_tensor.shape[0] % world_size == 0 implies a strict requirement that the input tensor's first dimension must be divisible by the world size. This is a critical constraint for the reduce_scatter_along_dim operation. Please add a docstring to the function clearly stating this precondition, or consider how to handle cases where this might not hold true (e.g., padding or a more flexible splitting strategy).
| assert ( | ||
| self.dcp_world_size == 1 | ||
| ), "FlashMLA does not support DCP for FP8 kv cache" |
There was a problem hiding this comment.
| # Note: This will produce an incorrect answer if we don't make | ||
| # the input_tensor contiguous. Possible bug in reduce_scatter_tensor? |
There was a problem hiding this comment.
This comment highlights a potential issue with reduce_scatter_tensor requiring contiguous input. If this is a known PyTorch bug, it would be beneficial to include a reference to the relevant PyTorch issue tracker. If it's a potential bug in the current implementation, further investigation might be warranted.
| ) | ||
| group_ranks.append(ranks) | ||
|
|
||
| # message queue broadcaster is only used in tensor model parallel group |
There was a problem hiding this comment.
The comment "message queue broadcaster is only used in tensor model parallel group" is directly above the initialization of the _DCP group. If the use_message_queue_broadcaster argument is also relevant for DCP, this comment might be misleading. Please clarify if DCP also utilizes the message queue broadcaster, or if this argument is redundant for DCP initialization.
| # Compute local lengths following the same formula as filter_seq_indices. | ||
| kv_len_arr_cpu = ((kv_len_arr_cpu - dcp_rank - 1) // dcp_world_size) + 1 |
There was a problem hiding this comment.
The comment "Compute local lengths following the same formula as filter_seq_indices" refers to filter_seq_indices, which is defined later in FlashInferMLAIndicesUpdaterDecode. For better code organization and to avoid potential inconsistencies, consider defining filter_seq_indices as a standalone helper function or a static method that can be easily reused and referenced.
| return self.real_allocator.free(free_index) | ||
|
|
||
| def filter_local_indices(self, indices): | ||
| # TODO write a triton kernel to make this faster |
| else: | ||
| self.rotary_emb = None | ||
|
|
||
| # TODO(augusto.yjh) 这里要改逻辑, local_heads是all heads, 而且还要返回lse,用来修正attn_out |
| layer_id=self.layer_id, | ||
| ) | ||
|
|
||
| # TODO(augusto.yjh) 这里要all_gather q_pe 和 q_node_out,以 tp8为例, [1, 8, 64] [1, 8, 512] 经过all gather后为 [1, 64, 64] [1, 64, 512], k_pe 为 [1, 1, 64], k_nope 为 [1, 1, 512], 从 local heads到all heads |
There was a problem hiding this comment.
| } | ||
|
|
||
| attn_output = self.attn_mqa( | ||
| # TODO(augusto.yjh) 返回lse, correct attn_output |
| # TODO(augusto.yjh) all gather lse,订正attn_output | ||
| # TODO(augusto.yjh) 执行reduce scatter, 先reduce拿到正确的 attn_output, 再按local_num_heads scatter attn_output |
There was a problem hiding this comment.
1bc6763 to
89b515d
Compare
6ed92f2 to
cde92cc
Compare
There was a problem hiding this comment.
when try to run the code I got
File "/root/sglang-ant/python/sglang/srt/model_executor/model_runner.py", line 2068
else:
^^^^
SyntaxError: invalid syntax
wondering is this else block aligned @staugust
There was a problem hiding this comment.
I will check modifications , maybe there's something wrong when rebasing to main branch.
There was a problem hiding this comment.
@Sophie8 Fixed, I'll do both performance and speed benchmark, and paste result later.
|
Really need this feature. vLLM TP+DP+DCP has performance issues. |
Thank you. Could you share your usage scenario? For example, details about the model architecture, GPU type, and parameters such as TP/DP/DCP? It's complicated when enable TP+DP+DCP. |
Sure. Our usage scenario is 128K long context or huge batch size on 64GB-HBM GPU(not nvidia, but is compatible),without DCP, the KVCache cost is unaffordable. We'd like to tune performance with 16 gpus using dp8tp2dcp2(or dp2tp8dcp8, moe tp16). PS: Is there some performance data of long context on H20? |
@heroes999 Got it, I've updated the code, maybe you can have a try to see whether it works. For now, the extra communication operations introduced by DCP works, but not tuned. I have just rebased the code, and I'll post a speed benchmark later. |
|
|
||
| # build decode context parallel groups | ||
| decode_context_model_parallel_size = get_dcp_size_from_env() | ||
| if decode_context_model_parallel_size > 1: |
There was a problem hiding this comment.
may can move the check to the argument part to give users better guidance about how to enable dcp and constraints of enablement?
There was a problem hiding this comment.
Sure, define new cli parameter in server_args.py should be better.
| grid = (B, H, 1) | ||
|
|
||
| regular_args = ( | ||
| out, |
There was a problem hiding this comment.
maybe a dumb q, just confused here, both the pointers for outputs_ptr and new_output_ptr are out, is this intended?
|
What about tp 4? |
The model has H attention heads. TP splits heads: DCP splits tokens: Final mapping: TP0+DP0: TP1+DP1: TP2+DP0: TP3+DP1: |
|
/tag-and-rerun-ci |
correct params for forward_extend in flashinfer_mla fix bugs in set dcp_kv_indptr make chunked req align with dcp_world_size fix bugs in compute attn for deepseek_v2 estimate pages for dcp with page_size * dcp_world_size re-org kv indice with same order Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
also gather k_pe calculate real kv indice make prefix_chunk_len align to dcp_world_size all gather prefix cache kv which is aligned to dcp_world_size fix bugs in fetch extend prefix kv cache fix bugs in gather kv for mha one shot fix bugs when rebase to main branch fix pre-commit ast errors return attn_out and lse when forward_batch is decode otherwise return attn_out without lse only return lse for dcp mla correct conditions to return lse for decode Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
…n is incompatible with TP group symm-mem. Modifications will be made after the resolution of the multi-group symmetric memory coexistence issue.) misc: remove unneed code after rebase fix: fix ar coredump when dcp use symmetric memory fea: add symm-mem unit perf test
71b13d3 to
ea0a21a
Compare
|
Here's performance benchmark with model
launch command MODEL=/home/models/moonshotai/Kimi-K2-Instruct-0905
SGLANG_DCP_SYMM_ONLY=true \
SGLANG_DCP=8 \
NCCL_DEBUG=WARN \
PYTHONUNBUFFERED=1 \
TORCHINDUCTOR_FX_GRAPH_CACHE=1 \
TORCHINDUCTOR_AUTOGRAD_CACHE=1 \
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
TORCHINDUCTOR_CACHE_DIR=/home/admin/inductor_root_cache \
nohup python3 -m sglang.launch_server \
--model-path ${MODEL} \
--host 0.0.0.0 \
--port 8188 \
--trust-remote-code \
--enable-cache-report \
--log-level info \
--tp-size 8 \
--max-running-requests 48 \
--mem-fraction-static 0.90 \
--chunked-prefill-size 32768 \
--context-length 262144 \
--attention-backend flashinfer \
--disable-radix-cache \
--enable-symm-mem \
&>> /home/local/workspace/scripts/sglang.out 2>&1 &Bench Result dcp8+tp8 max concurrency 48 dcp8+tp8 max concurrency 32 dcp8+tp8 max concurrency 8 tp8 max concurrency 8 attention dp8+tp8 max concurrency 48 , update mem fraction to 0.94 , otherwise, no cuda mem for KV cache |
|
dcp+tp with full graph supported? |
|
@heroes999 yes, full cuda graph is supported. |
|
@heroes999 Each tp rank has to keep compressed KV cache for models with MLA attention, with dcp, each dcp rank only keeps part of full compressed KV cache. |
@heroes999 I have made some additions to the document, hoping they will help you understand. cc @staugust |
|
is support tp+dcp+mtp? or are there any plans to support it? |
|
@Yangxinhub For now, this pr does not support tp+dcp+mtp. We'll support tp+dcp+mtp after this pr is reviewed and merged into main branch. |
PP Support? |
I’ve enabled TP+DCP+PP before, and it worked without any issues. You can give it a try too. |


Motivation
Here's the first step to fully implement #12196 to support much longer context with TP 8 under 8xH20.
Currently, it only works with attention backend flashinfer. It's compatible with chunked-prefill and decode cuda graph. It doesn't support radix-cache, pd disaggregation and mtp.
update 2025-12-04 21:30
prefix cache supported.
Modifications
Details in Decode-Context Parallelism (DCP) for DeepSeek-v2.
Modifications in compute attn_output
With dcp, kv cache are split into dcp ranks.
absorbed_qwithnum_tp_local_q_heads * dcp_sizeheads and partial kv cache. Hence, an all_gather is introduced to let each tp rank inside a dcp group has totalabsorbed_q.attn_outputandlsefor totalabsorbed_qand partialkv,lses are gathered via dcp group to correctattn_outputwith computed scaling factor.corrected_attn_outputmakes each tp rank keeps finalattn_outputfortp_local_q_headsand full kv, just like what pure tp does.Here a simple computation workflow for deepseek_v2 with tp+dcp:

Cache Management
To minimize changes to SGLang's core logic, we implemented a new
DCPTokenToKVPoolAllocator, and letTokenToKVPoolkeepsreal_kv_sizeno changed.Allocatable KV size is real_kv_size * dcp_world_size.
When ScheduleBatch checks free space or allocates KV buffer,
DCPTokenToKVPoolAllocatorbehaves as if it can allocate real_kv_size * dcp_world_size KV caches.Allocates one KV buffer index for each token in the request, and aligns each request with dcp_world_size * original_page_size as the alignment unit. This ensures token position of corresponding kv cache, token_idx % dcp_world_size equals to out_cache_loc % dcp_world_size, which simplifies the mapping logic of
out_cache_locand its real index inTokenToKVPool.Here's an exmaple, with page_size=1, dcp_world_size=4, two requests (r1: green, r2: yellow):

● r1 indices: [0, 1, 2, 3, 8, 9]
● r2 indices: [4, 5, 6]
All ranks see consistent out_cache_loc values, but keeps different kv cache.
When read/write kv cache from/into kv_buffer, the code likes below:
Accuracy Tests
benchmark/gsm8k/bench_sglang.py with dcp8 and chunked-prefill, radix cache enabled.
benchmark/gsm8k/bench_sglang.py with tp8, sglang commit: e8ba5a6
Benchmarking and Profiling
Checklist