[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step#14410
[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step#14410tom-jerr wants to merge 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @tom-jerr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a lightweight, single-step warmup mechanism for the denoising stages of diffusion pipelines. The primary goal is to enhance performance, particularly when leveraging Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a lightweight, 1-step warmup for the denoising stage, controlled by a new --enable-warmup flag. This is implemented in DenoisingStage and DmdDenoisingStage. The changes also include a refactoring of torch.compile logic into a reusable helper method and improvements to argument handling for compiled functions. The benchmarks show significant performance gains when warmup is used with torch.compile.
My feedback focuses on a critical bug in causal_denoising.py that could lead to a NameError, and several instances of commented-out code that should be removed for better maintainability.
| ) | ||
| current_start_frame += 1 | ||
| remaining_frames = input_frames - 1 | ||
| input_frames -= 1 |
There was a problem hiding this comment.
This change introduces a bug. The variable remaining_frames is used in the while loop at line 188, but it is no longer defined within this if block. This will cause a NameError when independent_first_frame is true and input_frames >= 1. Please revert this line to its original state to ensure remaining_frames is correctly initialized.
| input_frames -= 1 | |
| remaining_frames = input_frames - 1 |
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
| # if self.server_args.enable_torch_compile: | ||
| # self.transformer = torch.compile( | ||
| # self.transformer, mode="max-autotune", fullgraph=True | ||
| # ) |
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
2d07b50 to
c831e0f
Compare
c831e0f to
1b2e5a7
Compare
|
Could you also make sure, if it's necessary to expand the warm-up to all stages? Thanks! |
I will verify it. |
ConclusionI added warm-up to the text encoding and decoding stages, but these stages and end-to-end latency increased significantly. Adding warm-up to other stages may not be a good idea. Performance Comparison ReportDisable Torch Compile Performance Comparison Report0. Denosing Stage
[12-08 06:17:05] [DenoisingStage] average time per step: 0.7551 seconds
[12-08 06:17:05] [DenoisingStage] finished in 37.7647 seconds
[12-08 06:18:53] [DenoisingStage] average time per step: 0.5881 seconds
[12-08 06:18:53] [DenoisingStage] finished in 32.1070 seconds1. High-level Summary
2. Stage Breakdown
Enabel Torch Compile Performance Comparison Report0. Denosing Stage
[12-08 06:21:17] [DenoisingStage] average time per step: 1.6562 seconds
[12-08 06:21:17] [DenoisingStage] finished in 82.8159 seconds
[12-08 06:23:42] [DenoisingStage] average time per step: 0.9021 seconds
[12-08 06:23:42] [DenoisingStage] finished in 79.3291 seconds1. High-level Summary
2. Stage Breakdown
|
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
|
We might be considering no torch-compile for now.
How is that, could you do multiple tests and dump the perf comparison report? |
The more detailed information is here. And I tested multiple times. |
| pass | ||
| mode = os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs") | ||
| compiled_forward = torch.compile(getattr(module, "forward"), mode=mode) | ||
| setattr(module, "forward", compiled_forward) |
There was a problem hiding this comment.
You can add log records explaining the used compilation mode.
There was a problem hiding this comment.
I'll do this and fix bug below.
| ) | ||
| current_start_frame += 1 | ||
| remaining_frames = input_frames - 1 | ||
| input_frames -= 1 |
fd2d0f7 to
a23eaf2
Compare
Update enable torch.compile and fix denoisingdmd bug Signed-off-by: lzy <tomlzy213@gmail.com>
a23eaf2 to
bd917da
Compare
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py
Outdated
Show resolved
Hide resolved
|
/gemini please carefully review this PR, this PR is trying to do a complete and thorough warmup that in actual forward, there's no compile or warmup completely |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
I found PR #15773 warms up by doing multiple rounds generating before actually generating. Does that method conflict with the denoising stage lightweight warm-up? |
I'll find you by wechat. |
|
@tom-jerr I've made some updates to this PR. To sum up, now we insert an identical warmup req with |
|
/tag-and-rerun-ci |
I think I understand your point now, and I’ll rerun the tests based on these changes. I’ve discussed this with fsygd, and I‘m wondering whether the current implementation can be orthogonal to PR #15773, or if we should discuss this further. |
|
duplicate with #16213 |
Motivation
lightweight warmup, denoising stage only, 1-step #13692
Modifications
New Parameter Control:
Added the --enable-warmup parameter in server_args.py.
The default value is False. Users can explicitly enable the warmup function via this parameter.
Implemented Warmup Logic for Denoising Stages:
The _warmup method has been implemented in the following three core Denoising Stage classes. It is conditionally invoked before the forward pass based on the parameter settings.
DenoisingStage (denoising.py)
DmdDenoisingStage (denoising_dmd.py)
I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance.
I think that warm-up reduces the average execution time of each subsequent step, but since the denoising stage has too few steps (3 steps), the cost of adding a warm-up outweighs the benefits it brings.
Enable torch compile:
from #13641, just use the max-autotune-no-cudagraphs compile mode.
Benchmarking and Profiling
All tests were performed on the H100. I fixed code to enable torch.compile.(most code from #13641).
Now I just tested baseline and warm up.
Qwen/Qwen-Image Results
0. Denoising Stage Result
1. High-level Summary
2. Stage Breakdown
FastVideo / FastWan2.1-T2V-1.3B-Diffusers Results
0. Denoising Stage Result
1. High-level Summary
2. Stage Breakdown
Checklist