[NOT FINAL] add wip DSv4 aggregate and disaggregate recipes#85
Merged
ishandhanani merged 21 commits intomainfrom Apr 28, 2026
Merged
[NOT FINAL] add wip DSv4 aggregate and disaggregate recipes#85ishandhanani merged 21 commits intomainfrom
ishandhanani merged 21 commits intomainfrom
Conversation
Adds the dynamo + NIXL disaggregated counterpart to the existing `gb300-fp4/1k1k-dsv4/agg-*` recipes: 1 prefill node + 1 decode node, both TP=4 on a single GB300, MXFP4 MoE kernels, chunked-prefill 4096. Same DSv4-Pro checkpoint and `dsv4-grace-blackwell` container as the agg recipes; nginx fan-in container is pulled from Docker Hub via enroot. `benchmark.type` is `manual` so the recipe brings the disagg server up and stops there — pair with sa-bench (custom_tokenizer `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` + chat template) once the server is healthy. README updated with a `Disaggregated` table to keep the existing agg matrix intact. Made-with: Cursor
Builds on the existing 1P1D TP=4 disagg recipe by adding four more points along the disagg topology curve, all sharing the same dynamo + NIXL frontend and the `dsv4-grace-blackwell` container: - disagg-1p1d-dep4-mega-moe.yaml (2 nodes, 8 GPU; both DEP=4) - disagg-1p2d-dep4-to-dep8-mega-moe.yaml (3 nodes, 12 GPU; P DEP=4, D DEP=8) - disagg-2p2d-dep8-mega-moe.yaml (4 nodes, 16 GPU; both DEP=8) - disagg-2p2d-tp8-mxfp4.yaml (4 nodes, 16 GPU; both TP=8, MXFP4) DEP recipes use TP+DP+DP-attention+DeepEP (mega_moe / DeepGEMM), mirroring the agg-balanced-tep / agg-max-tpt-tep topology but split across prefill and decode roles. Multi-node decode recipes intentionally do NOT set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because CAR_V2 is single-node only and silently corrupts results across nodes. Also tightens the existing disagg-1p1d-tp4-mxfp4.yaml: switches from `benchmark.type: manual` to a low-latency sa-bench sweep (conc 4..128) and adds the same mrr / cgmb / mfs knobs as the new recipes for reproducibility. README gains: - a prominent NIXL state-buffer-fix prerequisite warning (upstream sglang PR pending) so reviewers know what container behaviour the recipes assume, - an XPYD = nodes (not instances) clarification, - a verified-throughput table from sa-bench runs at isl=osl=1024. Headline: the asymmetric 1P2D DEP4->DEP8 config delivers the highest per-GPU total token throughput (5,572 TPS/GPU at conc=2048) because at 1k/1k the workload is decode-heavy, so doubling the decode EP domain (4 -> 8 GPUs) buys far more than scaling prefill. Recipes are intentionally clean of local mounts / debug paths — pick up the required nixl/conn.py state-buffer-transfer fix via the container build process until the upstream sglang fix lands. Made-with: Cursor
This was referenced Apr 26, 2026
… into recipes/dsv4-agg-disagg
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 only works when SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 is also set. Without it, DeepEP buffer is too small for cuda-graph-max-bs=1024/2048 and capture hits deep_ep.cpp:1233 assertion. Add the full mega_moe env block to all three *-mega-moe.yaml, plus SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 only on single-node sides.
* Add DSV4 Pro GB300 high-throughput recipe * fix wideep oom * optimize perf
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
these are work in progress don't grab these and try to run them out of the box...that would be stupid
Summary
Creates an upstream
NVIDIA/srt-slurmbranch that contains the already-merged aggregate DeepSeek-V4 recipe work plus the disaggregated recipe work from fork PR #75.Because #70 has already merged the aggregate recipes into
main, this PR diff is intentionally focused on the remaining disaggregated additions:recipes/gb300-fp4/1k1k-dsv4/.This is intended to supersede the fork-based disagg PR #75. The aggregate recipe portion is already in
mainvia #70, so it is present on this branch but does not reappear in the diff.Validation
uv run srtctl dry-run -f recipes/gb300-fp4/1k1k-dsv4/disagg-*.yamlmake check