Skip to content

[NOT FINAL] add wip DSv4 aggregate and disaggregate recipes#85

Merged
ishandhanani merged 21 commits intomainfrom
recipes/dsv4-agg-disagg
Apr 28, 2026
Merged

[NOT FINAL] add wip DSv4 aggregate and disaggregate recipes#85
ishandhanani merged 21 commits intomainfrom
recipes/dsv4-agg-disagg

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Apr 26, 2026

these are work in progress don't grab these and try to run them out of the box...that would be stupid

Summary

Creates an upstream NVIDIA/srt-slurm branch that contains the already-merged aggregate DeepSeek-V4 recipe work plus the disaggregated recipe work from fork PR #75.

Because #70 has already merged the aggregate recipes into main, this PR diff is intentionally focused on the remaining disaggregated additions:

  • Adds five GB300 DeepSeek-V4-Pro disaggregated recipes under recipes/gb300-fp4/1k1k-dsv4/.
  • Extends the GB300 DSv4 README with disaggregated topology documentation, NIXL state-buffer caveat, XPYD node semantics, and measured throughput table.
  • Keeps the branch in the upstream repo so follow-up review and iteration no longer depends on the fork branch.

This is intended to supersede the fork-based disagg PR #75. The aggregate recipe portion is already in main via #70, so it is present on this branch but does not reappear in the diff.

Validation

  • uv run srtctl dry-run -f recipes/gb300-fp4/1k1k-dsv4/disagg-*.yaml
  • make check

YAMY1234 and others added 3 commits April 24, 2026 18:04
Adds the dynamo + NIXL disaggregated counterpart to the existing
`gb300-fp4/1k1k-dsv4/agg-*` recipes: 1 prefill node + 1 decode node, both
TP=4 on a single GB300, MXFP4 MoE kernels, chunked-prefill 4096. Same
DSv4-Pro checkpoint and `dsv4-grace-blackwell` container as the agg
recipes; nginx fan-in container is pulled from Docker Hub via enroot.

`benchmark.type` is `manual` so the recipe brings the disagg server up
and stops there — pair with sa-bench (custom_tokenizer
`sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` + chat
template) once the server is healthy.

README updated with a `Disaggregated` table to keep the existing agg
matrix intact.

Made-with: Cursor
Builds on the existing 1P1D TP=4 disagg recipe by adding four more
points along the disagg topology curve, all sharing the same dynamo +
NIXL frontend and the `dsv4-grace-blackwell` container:

- disagg-1p1d-dep4-mega-moe.yaml         (2 nodes,  8 GPU; both DEP=4)
- disagg-1p2d-dep4-to-dep8-mega-moe.yaml (3 nodes, 12 GPU; P DEP=4, D DEP=8)
- disagg-2p2d-dep8-mega-moe.yaml         (4 nodes, 16 GPU; both DEP=8)
- disagg-2p2d-tp8-mxfp4.yaml             (4 nodes, 16 GPU; both TP=8, MXFP4)

DEP recipes use TP+DP+DP-attention+DeepEP (mega_moe / DeepGEMM),
mirroring the agg-balanced-tep / agg-max-tpt-tep topology but split
across prefill and decode roles. Multi-node decode recipes intentionally
do NOT set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because CAR_V2 is
single-node only and silently corrupts results across nodes.

Also tightens the existing disagg-1p1d-tp4-mxfp4.yaml: switches from
`benchmark.type: manual` to a low-latency sa-bench sweep (conc 4..128)
and adds the same mrr / cgmb / mfs knobs as the new recipes for
reproducibility.

README gains:
- a prominent NIXL state-buffer-fix prerequisite warning (upstream
  sglang PR pending) so reviewers know what container behaviour the
  recipes assume,
- an XPYD = nodes (not instances) clarification,
- a verified-throughput table from sa-bench runs at isl=osl=1024.

Headline: the asymmetric 1P2D DEP4->DEP8 config delivers the highest
per-GPU total token throughput (5,572 TPS/GPU at conc=2048) because at
1k/1k the workload is decode-heavy, so doubling the decode EP domain
(4 -> 8 GPUs) buys far more than scaling prefill.

Recipes are intentionally clean of local mounts / debug paths — pick up
the required nixl/conn.py state-buffer-transfer fix via the container
build process until the upstream sglang fix lands.

Made-with: Cursor
@ishandhanani ishandhanani marked this pull request as ready for review April 26, 2026 22:26
@ishandhanani ishandhanani changed the title [codex] add DSv4 aggregate and disaggregate recipes branch [wip] add DSv4 aggregate and disaggregate recipes branch Apr 26, 2026
ishandhanani and others added 11 commits April 26, 2026 18:17
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 only works when
SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 is also set. Without it, DeepEP
buffer is too small for cuda-graph-max-bs=1024/2048 and capture
hits deep_ep.cpp:1233 assertion.

Add the full mega_moe env block to all three *-mega-moe.yaml,
plus SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 only on single-node sides.
* Add DSV4 Pro GB300 high-throughput recipe

* fix wideep oom

* optimize perf
@ishandhanani ishandhanani changed the title [wip] add DSv4 aggregate and disaggregate recipes branch [NOT FINAL] add wip DSv4 aggregate and disaggregate recipes Apr 28, 2026
@ishandhanani ishandhanani merged commit 1d665f8 into main Apr 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants