Skip to content

adapt bench_one_batch on dp-attention#7169

Open
zyksir wants to merge 1 commit intosgl-project:mainfrom
zyksir:fix_bench_one_batch_on_dp_attention
Open

adapt bench_one_batch on dp-attention#7169
zyksir wants to merge 1 commit intosgl-project:mainfrom
zyksir:fix_bench_one_batch_on_dp_attention

Conversation

@zyksir
Copy link
Collaborator

@zyksir zyksir commented Jun 14, 2025

Motivation

Previous this PR support made bench_one_batch support enable_dp_attention on correctness test. Recently bench_one_batch has some issues about dp-attention

  1. the API of prepare_dp_attn_batch_raw is changed, while the arguments on bench_one_batch.py does not change. This can cause error if we run bench_one_batch.py with dp-attention cc @f
  2. If we run latency test, the decode does not use cuda graph. This is because we set cuda_graph_max_bs=max(batch_size), while the actually batch_size is dp_size * batch_size since we are running dp_size batches at the same time. This will cause the actual batch_size to be larger than cuda_graph_max_bs.

Modifications

  1. change the arguments of prepare_dp_attn_batch_raw
  2. modify the batch size so that the comparison of --tp and --tp --dp --enable-dp-attention is fair.

This is the command I used. The following result does not mean that dp-attention is worse since in bench_one_batch, the advantage of saving KVCache is not tested. But it does show some feature work if we want to improve dp-attention.

python3 -m sglang.bench_one_batch --model Qwen/Qwen3-30B-A3B --batch 32 --input-len 256 --output-len 32 --tp 4
python3 -m sglang.bench_one_batch --model Qwen/Qwen3-30B-A3B --batch 32 --input-len 256 --output-len 32 --tp 4 --dp 4 --enable-dp-attention 

This is the result of tp4 on H100:
image
This is the result of tp4 and dp4 on H100:
image

By adding --profile, we can see that the drop of performance comes from:

  1. The addition memory bound kernels added before and after all-reduce
    image
  2. The GEMM in Attention Layer is slower since the weight is larger.
    This is qkv_proj in dp-attention
image

This is qkv_proj in baseline.
image

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zyksir, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses compatibility and correctness issues within the bench_one_batch.py script when deep parallel attention (enable_dp_attention) is enabled. It updates the script to use the correct API for preparing DP attention batches and adjusts batch size calculations to reflect the total effective batch size across DP ranks, improving the accuracy and fairness of benchmark results and enabling CUDA graph usage for decode steps.

Highlights

  • Fix prepare_dp_attn_batch_raw API usage: Updated the arguments passed to prepare_dp_attn_batch_raw in bench_one_batch.py to match recent API changes, resolving an error when running benchmarks with deep parallel attention enabled.
  • Correct batch size handling for DP: Modified the bench_one_batch.py script to correctly account for the total batch size (batch_size * dp_size) when calculating throughput, logging, profiling filenames, and setting cuda_graph_max_bs. This ensures fair comparisons and proper CUDA graph usage during latency tests with deep parallel attention.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes adapt the bench_one_batch script to correctly handle data parallel (DP) attention. This includes updating the arguments of prepare_dp_attn_batch_raw, modifying the batch size calculation to account for dp_size, and ensuring that latency tests use the correct batch size and throughput calculations when DP attention is enabled. These modifications aim to provide a more accurate and fair comparison of performance with and without DP attention.

log_decode_step,
profile,
profile_filename_prefix,
dp_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good to see dp_size being passed to latency_test_run_once. This ensures the correct batch size is used in the calculations.

Comment on lines 504 to +517
server_args.cuda_graph_max_bs = max(bench_args.batch_size)
if server_args.enable_dp_attention:
sub_batch_size = []
for i in range(len(bench_args.batch_size)):
assert bench_args.batch_size[i] % server_args.dp_size == 0
sub_batch_size.append(bench_args.batch_size[i] // server_args.dp_size)
bench_args.batch_size = tuple(sub_batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic correctly adjusts the batch size when DP attention is enabled, ensuring that the comparison between TP and TP+DP is fair.

Comment on lines +279 to +281
enable_two_batch_overlap=model_runner.server_args.enable_two_batch_overlap,
enable_deepep_moe=model_runner.server_args.enable_deepep_moe,
deepep_mode=DeepEPMode[model_runner.server_args.deepep_mode],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These parameters are added to prepare_dp_attn_batch_raw to enable more flexible control over the distributed execution. It's good to see these being included.

measurement_results = {
"run_name": run_name,
"batch_size": batch_size,
"batch_size": batch_size * dp_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Multiplying batch_size by dp_size here correctly reflects the actual number of samples being processed in the distributed setting.

prefill_latency = time.perf_counter() - tic
tot_latency += prefill_latency
throughput = input_len * batch_size / prefill_latency
throughput = input_len * batch_size * dp_size / prefill_latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The throughput calculation now correctly accounts for the dp_size, providing a more accurate measure of performance in the distributed attention setting.

latency = time.perf_counter() - tic
tot_latency += latency
throughput = batch_size / latency
throughput = batch_size * dp_size / latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The throughput calculation here correctly accounts for the dp_size during the decode step.

if profile:
profiler.stop()
profile_filename = f"{profile_filename_prefix}_batch{batch_size}_input{input_len}_output{output_len}.trace.json.gz"
profile_filename = f"{profile_filename_prefix}_batch{batch_size * dp_size}_input{input_len}_output{output_len}.trace.json.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The profiling filename now includes the dp_size in the batch size, which is helpful for distinguishing profiling results in DP attention scenarios.

measurement_results["median_decode_throughput"] = med_decode_throughput

throughput = (input_len + output_len) * batch_size / tot_latency
throughput = (input_len + output_len) * batch_size * dp_size / tot_latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The throughput calculation now correctly accounts for the dp_size in the total throughput calculation.

log_decode_step=0,
profile=False,
profile_filename_prefix="", # not used
dp_size=1 if not server_args.enable_dp_attention else server_args.dp_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing dp_size to latency_test_run_once ensures that the warmup phase also uses the correct batch size.

bench_args.log_decode_step,
bench_args.profile if tp_rank == 0 else None,
bench_args.profile_filename_prefix,
1 if not server_args.enable_dp_attention else server_args.dp_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing dp_size to latency_test_run_once ensures that the benchmark phase also uses the correct batch size.

@yhyang201
Copy link
Collaborator

This is very useful. Can this be merged? cc @zhyncs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants