adapt bench_one_batch on dp-attention#7169
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @zyksir, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses compatibility and correctness issues within the bench_one_batch.py script when deep parallel attention (enable_dp_attention) is enabled. It updates the script to use the correct API for preparing DP attention batches and adjusts batch size calculations to reflect the total effective batch size across DP ranks, improving the accuracy and fairness of benchmark results and enabling CUDA graph usage for decode steps.
Highlights
- Fix
prepare_dp_attn_batch_rawAPI usage: Updated the arguments passed toprepare_dp_attn_batch_rawinbench_one_batch.pyto match recent API changes, resolving an error when running benchmarks with deep parallel attention enabled. - Correct batch size handling for DP: Modified the
bench_one_batch.pyscript to correctly account for the total batch size (batch_size * dp_size) when calculating throughput, logging, profiling filenames, and settingcuda_graph_max_bs. This ensures fair comparisons and proper CUDA graph usage during latency tests with deep parallel attention.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
The code changes adapt the bench_one_batch script to correctly handle data parallel (DP) attention. This includes updating the arguments of prepare_dp_attn_batch_raw, modifying the batch size calculation to account for dp_size, and ensuring that latency tests use the correct batch size and throughput calculations when DP attention is enabled. These modifications aim to provide a more accurate and fair comparison of performance with and without DP attention.
| log_decode_step, | ||
| profile, | ||
| profile_filename_prefix, | ||
| dp_size, |
| server_args.cuda_graph_max_bs = max(bench_args.batch_size) | ||
| if server_args.enable_dp_attention: | ||
| sub_batch_size = [] | ||
| for i in range(len(bench_args.batch_size)): | ||
| assert bench_args.batch_size[i] % server_args.dp_size == 0 | ||
| sub_batch_size.append(bench_args.batch_size[i] // server_args.dp_size) | ||
| bench_args.batch_size = tuple(sub_batch_size) |
| enable_two_batch_overlap=model_runner.server_args.enable_two_batch_overlap, | ||
| enable_deepep_moe=model_runner.server_args.enable_deepep_moe, | ||
| deepep_mode=DeepEPMode[model_runner.server_args.deepep_mode], |
| measurement_results = { | ||
| "run_name": run_name, | ||
| "batch_size": batch_size, | ||
| "batch_size": batch_size * dp_size, |
| prefill_latency = time.perf_counter() - tic | ||
| tot_latency += prefill_latency | ||
| throughput = input_len * batch_size / prefill_latency | ||
| throughput = input_len * batch_size * dp_size / prefill_latency |
| latency = time.perf_counter() - tic | ||
| tot_latency += latency | ||
| throughput = batch_size / latency | ||
| throughput = batch_size * dp_size / latency |
| if profile: | ||
| profiler.stop() | ||
| profile_filename = f"{profile_filename_prefix}_batch{batch_size}_input{input_len}_output{output_len}.trace.json.gz" | ||
| profile_filename = f"{profile_filename_prefix}_batch{batch_size * dp_size}_input{input_len}_output{output_len}.trace.json.gz" |
| measurement_results["median_decode_throughput"] = med_decode_throughput | ||
|
|
||
| throughput = (input_len + output_len) * batch_size / tot_latency | ||
| throughput = (input_len + output_len) * batch_size * dp_size / tot_latency |
| log_decode_step=0, | ||
| profile=False, | ||
| profile_filename_prefix="", # not used | ||
| dp_size=1 if not server_args.enable_dp_attention else server_args.dp_size, |
| bench_args.log_decode_step, | ||
| bench_args.profile if tp_rank == 0 else None, | ||
| bench_args.profile_filename_prefix, | ||
| 1 if not server_args.enable_dp_attention else server_args.dp_size, |
|
This is very useful. Can this be merged? cc @zhyncs |
Motivation
Previous this PR support made bench_one_batch support enable_dp_attention on correctness test. Recently bench_one_batch has some issues about dp-attention
prepare_dp_attn_batch_rawis changed, while the arguments onbench_one_batch.pydoes not change. This can cause error if we runbench_one_batch.pywith dp-attention cc @fcuda_graph_max_bs=max(batch_size), while the actually batch_size isdp_size * batch_sizesince we are runningdp_sizebatches at the same time. This will cause the actual batch_size to be larger thancuda_graph_max_bs.Modifications
prepare_dp_attn_batch_raw--tpand--tp --dp --enable-dp-attentionis fair.This is the command I used. The following result does not mean that dp-attention is worse since in bench_one_batch, the advantage of saving KVCache is not tested. But it does show some feature work if we want to improve dp-attention.
This is the result of tp4 on H100:


This is the result of tp4 and dp4 on H100:
By adding
--profile, we can see that the drop of performance comes from:This is qkv_proj in dp-attention
This is qkv_proj in baseline.

Checklist