adapt bench_one_batch on dp-attention by zyksir · Pull Request #7169 · sgl-project/sglang

zyksir · 2025-06-14T01:35:41Z

Motivation

Previous this PR support made bench_one_batch support enable_dp_attention on correctness test. Recently bench_one_batch has some issues about dp-attention

the API of prepare_dp_attn_batch_raw is changed, while the arguments on bench_one_batch.py does not change. This can cause error if we run bench_one_batch.py with dp-attention cc @f
If we run latency test, the decode does not use cuda graph. This is because we set cuda_graph_max_bs=max(batch_size), while the actually batch_size is dp_size * batch_size since we are running dp_size batches at the same time. This will cause the actual batch_size to be larger than cuda_graph_max_bs.

Modifications

change the arguments of prepare_dp_attn_batch_raw
modify the batch size so that the comparison of --tp and --tp --dp --enable-dp-attention is fair.

This is the command I used. The following result does not mean that dp-attention is worse since in bench_one_batch, the advantage of saving KVCache is not tested. But it does show some feature work if we want to improve dp-attention.

python3 -m sglang.bench_one_batch --model Qwen/Qwen3-30B-A3B --batch 32 --input-len 256 --output-len 32 --tp 4
python3 -m sglang.bench_one_batch --model Qwen/Qwen3-30B-A3B --batch 32 --input-len 256 --output-len 32 --tp 4 --dp 4 --enable-dp-attention

This is the result of tp4 on H100:

This is the result of tp4 and dp4 on H100:

By adding --profile, we can see that the drop of performance comes from:

The addition memory bound kernels added before and after all-reduce
The GEMM in Attention Layer is slower since the weight is larger.
This is qkv_proj in dp-attention

This is qkv_proj in baseline.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @zyksir, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses compatibility and correctness issues within the bench_one_batch.py script when deep parallel attention (enable_dp_attention) is enabled. It updates the script to use the correct API for preparing DP attention batches and adjusts batch size calculations to reflect the total effective batch size across DP ranks, improving the accuracy and fairness of benchmark results and enabling CUDA graph usage for decode steps.

Highlights

Fix prepare_dp_attn_batch_raw API usage: Updated the arguments passed to prepare_dp_attn_batch_raw in bench_one_batch.py to match recent API changes, resolving an error when running benchmarks with deep parallel attention enabled.
Correct batch size handling for DP: Modified the bench_one_batch.py script to correctly account for the total batch size (batch_size * dp_size) when calculating throughput, logging, profiling filenames, and setting cuda_graph_max_bs. This ensures fair comparisons and proper CUDA graph usage during latency tests with deep parallel attention.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes adapt the bench_one_batch script to correctly handle data parallel (DP) attention. This includes updating the arguments of prepare_dp_attn_batch_raw, modifying the batch size calculation to account for dp_size, and ensuring that latency tests use the correct batch size and throughput calculations when DP attention is enabled. These modifications aim to provide a more accurate and fair comparison of performance with and without DP attention.

gemini-code-assist · 2025-06-14T01:37:18Z