[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step by tom-jerr · Pull Request #14410 · sgl-project/sglang

tom-jerr · 2025-12-04T09:09:43Z

Motivation

lightweight warmup, denoising stage only, 1-step #13692

Modifications

New Parameter Control:

Added the --enable-warmup parameter in server_args.py.
The default value is False. Users can explicitly enable the warmup function via this parameter.

Implemented Warmup Logic for Denoising Stages:

The _warmup method has been implemented in the following three core Denoising Stage classes. It is conditionally invoked before the forward pass based on the parameter settings.
- DenoisingStage (denoising.py)
- DmdDenoisingStage (denoising_dmd.py)

⚠️ Attention:
I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance.

I think that warm-up reduces the average execution time of each subsequent step, but since the denoising stage has too few steps (3 steps), the cost of adding a warm-up outweighs the benefits it brings.

Enable torch compile:
from #13641, just use the max-autotune-no-cudagraphs compile mode.

Benchmarking and Profiling

All tests were performed on the H100. I fixed code to enable torch.compile.(most code from #13641).

Now I just tested baseline and warm up.

Qwen/Qwen-Image Results

0. Denoising Stage Result

	Baseline	Warm Up
Time per denoising step in torch profile	727ms	485ms
E2E latency per denoising step	1.0839s	0.7023s

Baseline

Warm Up

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	64487.07 ms	44302.05 ms	-20185.03 ms (-31.3%)	✅
Throughput	0.02 req/s	0.02 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	Warm Up (ms)	Diff (ms)	Diff (%)
InputValidationStage	0.24	0.06	-0.18	-74.8%
TextEncodingStage	1484.17	1537.20	+53.03	+3.6%
ConditioningStage	0.04	0.01	-0.03	-70.5%
TimestepPreparationStage	59.99	25.27	-34.72	-57.9%
LatentPreparationStage	0.58	0.26	-0.32	-54.7%
DenoisingStage	62248.00	41786.81	-20461.18	-32.9%
DecodingStage	669.70	942.67	+272.97	+40.8%

FastVideo / FastWan2.1-T2V-1.3B-Diffusers Results

Warm-up reduces the average execution time of each subsequent step, but since the denoising stage has too few steps (3 steps), the cost of adding a warm-up outweighs the benefits it brings.

0. Denoising Stage Result

	Baseline	Warm Up
E2E latency per denoising step	2.2530s	1.3381s

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	15062.46 ms	15955.15 ms	+892.69 ms (+5.9%)	❌
Throughput	0.07 req/s	0.06 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	Warm Up (ms)	Diff (ms)	Diff (%)
InputValidationStage	0.05	0.06	+0.01	+16.1%
TextEncodingStage	4445.95	4705.35	+259.40	+5.8%
ConditioningStage	0.02	0.02	+0.00	+30.2%
TimestepPreparationStage	32.56	29.05	-3.52	-10.8%
LatentPreparationStage	0.88	0.97	+0.10	+11.2%
DenoisingStage	6123.29	7070.25	+946.96	+15.5%
DecodingStage	4452.30	4141.41	-310.89	-7.0%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-04T09:10:05Z

Summary of Changes

Hello @tom-jerr, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a lightweight, single-step warmup mechanism for the denoising stages of diffusion pipelines. The primary goal is to enhance performance, particularly when leveraging torch.compile, by preparing the model for efficient execution. The changes include a new configuration parameter to enable this feature and the integration of the warmup logic into core denoising components, demonstrating notable latency improvements in various benchmarks.

Highlights

New Warmup Feature: Introduced a new --enable-warmup command-line parameter and a corresponding enable_warmup flag in ServerArgs to control a lightweight, 1-step warmup mechanism for diffusion pipelines. This feature is disabled by default.
Denoising Stage Warmup: Implemented a _warmup method within the DenoisingStage and DmdDenoisingStage classes. This method performs a single-step forward pass to pre-compile or optimize the model's execution graph, specifically targeting the denoising process.
Torch Compile Integration Refinement: Refactored the torch.compile logic into a reusable torch_compile_module method, which also attempts to enable inductor overlap tweak for potential further optimization. This change streamlines how modules are compiled and ensures consistent application of torch.compile settings.
Performance Improvements: Benchmarking shows significant end-to-end latency reductions (up to 49.2% for FastHunyuan-diffusers) when the warmup is enabled in conjunction with torch.compile. However, enabling warmup without torch.compile can lead to performance degradation for some models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a lightweight, 1-step warmup for the denoising stage, controlled by a new --enable-warmup flag. This is implemented in DenoisingStage and DmdDenoisingStage. The changes also include a refactoring of torch.compile logic into a reusable helper method and improvements to argument handling for compiled functions. The benchmarks show significant performance gains when warmup is used with torch.compile.

My feedback focuses on a critical bug in causal_denoising.py that could lead to a NameError, and several instances of commented-out code that should be removed for better maintainability.

gemini-code-assist · 2025-12-04T09:11:34Z

python/sglang/multimodal_gen/runtime/pipelines_core/stages/causal_denoising.py

                    )
                current_start_frame += 1
-                remaining_frames = input_frames - 1
+                input_frames -= 1


This change introduces a bug. The variable remaining_frames is used in the while loop at line 188, but it is no longer defined within this if block. This will cause a NameError when independent_first_frame is true and input_frames >= 1. Please revert this line to its original state to ensure remaining_frames is correctly initialized.

Suggested change

input_frames -= 1

remaining_frames = input_frames - 1

It's a bug?

please fix it

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

gemini-code-assist · 2025-12-04T09:11:34Z

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

+            # if self.server_args.enable_torch_compile:
+            #     self.transformer = torch.compile(
+            #         self.transformer, mode="max-autotune", fullgraph=True
+            #     )


This block of commented-out code should be removed to keep the codebase clean.

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

mickqian · 2025-12-07T06:04:36Z

Could you also make sure, if it's necessary to expand the warm-up to all stages? Thanks!

tom-jerr · 2025-12-07T06:17:41Z

Could you also make sure, if it's necessary to expand the warm-up to all stages? Thanks!

I will verify it.

tom-jerr · 2025-12-08T06:15:05Z

Conclusion

I added warm-up to the text encoding and decoding stages, but these stages and end-to-end latency increased significantly.

Adding warm-up to other stages may not be a good idea.

Performance Comparison Report

Disable Torch Compile Performance Comparison Report

0. Denosing Stage

Baseline

[12-08 06:17:05] [DenoisingStage] average time per step: 0.7551 seconds
[12-08 06:17:05] [DenoisingStage] finished in 37.7647 seconds

Warm Up

[12-08 06:18:53] [DenoisingStage] average time per step: 0.5881 seconds
[12-08 06:18:53] [DenoisingStage] finished in 32.1070 seconds

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	39253.60 ms	34340.59 ms	-4913.01 ms (-12.5%)	✅
Throughput	0.03 req/s	0.03 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	0.09	0.06	-0.03	-34.3%	⚪️
TextEncodingStage	1032.05	1588.60	+556.55	+53.9%	🔴
ConditioningStage	0.01	0.01	-0.00	-25.4%	⚪️
TimestepPreparationStage	31.91	17.48	-14.43	-45.2%	⚪️
LatentPreparationStage	0.27	0.22	-0.05	-18.0%	⚪️
DenoisingStage	37764.52	32106.88	-5657.63	-15.0%	🟢
DecodingStage	415.12	620.19	+205.07	+49.4%	🔴

Enabel Torch Compile Performance Comparison Report

0. Denosing Stage

Baseline

[12-08 06:21:17] [DenoisingStage] average time per step: 1.6562 seconds
[12-08 06:21:17] [DenoisingStage] finished in 82.8159 seconds

Warm Up

[12-08 06:23:42] [DenoisingStage] average time per step: 0.9021 seconds
[12-08 06:23:42] [DenoisingStage] finished in 79.3291 seconds

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	84261.16 ms	81671.65 ms	-2589.51 ms (-3.1%)	✅
Throughput	0.01 req/s	0.01 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	0.05	0.14	+0.08	+149.2%	⚪️
TextEncodingStage	920.33	1733.51	+813.18	+88.4%	🔴
ConditioningStage	0.01	0.01	+0.00	+45.9%	⚪️
TimestepPreparationStage	14.56	26.71	+12.15	+83.4%	⚪️
LatentPreparationStage	0.20	0.30	+0.10	+47.9%	⚪️
DenoisingStage	82815.79	79328.94	-3486.86	-4.2%	⚪️
DecodingStage	501.43	568.73	+67.30	+13.4%	⚪️

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

mickqian · 2025-12-09T07:45:23Z

We might be considering no torch-compile for now.

I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance; however, enabling --enable-torch-compile

How is that, could you do multiple tests and dump the perf comparison report?

tom-jerr · 2025-12-11T04:00:17Z

We might be considering no torch-compile for now.

I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance; however, enabling --enable-torch-compile

How is that, could you do multiple tests and dump the perf comparison report?

The more detailed information is here. And I tested multiple times.

BBuf · 2025-12-28T12:43:25Z

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

+            pass
+        mode = os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs")
+        compiled_forward = torch.compile(getattr(module, "forward"), mode=mode)
+        setattr(module, "forward", compiled_forward)


You can add log records explaining the used compilation mode.

I'll do this and fix bug below.

BBuf · 2025-12-28T12:51:47Z

python/sglang/multimodal_gen/runtime/pipelines_core/stages/causal_denoising.py

                    )
                current_start_frame += 1
-                remaining_frames = input_frames - 1
+                input_frames -= 1


It's a bug?

Update enable torch.compile and fix denoisingdmd bug Signed-off-by: lzy <tomlzy213@gmail.com>

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py

mickqian · 2025-12-31T01:32:53Z

/gemini please carefully review this PR, this PR is trying to do a complete and thorough warmup that in actual forward, there's no compile or warmup completely

gemini-code-assist · 2025-12-31T01:32:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

tom-jerr · 2025-12-31T02:01:36Z

I found PR #15773 warms up by doing multiple rounds generating before actually generating.

Does that method conflict with the denoising stage lightweight warm-up?
I’m a bit confused right now — which operations are we actually warming up for?@mickqian @yhyang201

fsygd · 2025-12-31T03:11:59Z

I found PR #15773 warms up by doing multiple rounds generating before actually generating.

Does that method conflict with the denoising stage lightweight warm-up? I’m a bit confused right now — which operations are we actually warming up for?@mickqian @yhyang201

I'll find you by wechat.

mickqian · 2025-12-31T05:12:56Z

@tom-jerr I've made some updates to this PR.

To sum up, now we insert an identical warmup req with --num-inference-steps=1, before actual req

mickqian · 2025-12-31T05:24:39Z

/tag-and-rerun-ci

tom-jerr · 2025-12-31T05:42:37Z

@tom-jerr I've made some updates to this PR.

To sum up, now we insert an identical warmup req with --num-inference-steps=1, before actual req

I think I understand your point now, and I’ll rerun the tests based on these changes.

I’ve discussed this with fsygd, and I‘m wondering whether the current implementation can be orthogonal to PR #15773, or if we should discuss this further.

mickqian · 2025-12-31T08:21:30Z

duplicate with #16213

tom-jerr requested review from mickqian and yhyang201 as code owners December 4, 2025 09:09

github-actions bot added the diffusion SGLang Diffusion label Dec 4, 2025

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

tom-jerr force-pushed the diffusion-warmup-new branch 2 times, most recently from 2d07b50 to c831e0f Compare December 4, 2025 13:21

tom-jerr marked this pull request as draft December 4, 2025 14:02

tom-jerr force-pushed the diffusion-warmup-new branch from c831e0f to 1b2e5a7 Compare December 5, 2025 06:46

tom-jerr marked this pull request as ready for review December 5, 2025 06:46

zcnrex reviewed Dec 8, 2025

View reviewed changes

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py Outdated Show resolved Hide resolved

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py Outdated Show resolved Hide resolved

BBuf reviewed Dec 28, 2025

View reviewed changes

tom-jerr force-pushed the diffusion-warmup-new branch from fd2d0f7 to a23eaf2 Compare December 28, 2025 14:57

tom-jerr requested review from Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 28, 2025 14:57

[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step

bd917da

Update enable torch.compile and fix denoisingdmd bug Signed-off-by: lzy <tomlzy213@gmail.com>

tom-jerr force-pushed the diffusion-warmup-new branch from a23eaf2 to bd917da Compare December 28, 2025 15:24

BBuf approved these changes Dec 31, 2025

View reviewed changes

mickqian reviewed Dec 31, 2025

View reviewed changes

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py Outdated Show resolved Hide resolved

mickqian reviewed Dec 31, 2025

View reviewed changes

python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py Outdated Show resolved Hide resolved

mickqian added 3 commits December 31, 2025 13:08

scheduler refactor and log time with warmup excluded

aca0b4c

lint

0fc2388

lint

fe6ec20

mickqian added 2 commits December 31, 2025 13:16

typo

d08c02d

typo

60e648f

mickqian approved these changes Dec 31, 2025

View reviewed changes

github-actions bot added the run-ci label Dec 31, 2025

fix lora response

42f7954

mickqian mentioned this pull request Dec 31, 2025

[diffusion] feat: support lightweight e2e warmup for benchmarking #16213

Merged

5 tasks

mickqian closed this Dec 31, 2025

Conversation

tom-jerr commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking and Profiling

Qwen/Qwen-Image Results

0. Denoising Stage Result

1. High-level Summary

2. Stage Breakdown

FastVideo / FastWan2.1-T2V-1.3B-Diffusers Results

0. Denoising Stage Result

1. High-level Summary

2. Stage Breakdown

Checklist

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

BBuf Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

mickqian Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mickqian commented Dec 7, 2025

Uh oh!

tom-jerr commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tom-jerr commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conclusion

Performance Comparison Report

Disable Torch Compile Performance Comparison Report

0. Denosing Stage

1. High-level Summary

2. Stage Breakdown

Enabel Torch Compile Performance Comparison Report

0. Denosing Stage

1. High-level Summary

2. Stage Breakdown

Uh oh!

Uh oh!

Uh oh!

mickqian commented Dec 9, 2025

Uh oh!

tom-jerr commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

tom-jerr Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

BBuf Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mickqian commented Dec 31, 2025

Uh oh!

gemini-code-assist bot commented Dec 31, 2025

Uh oh!

tom-jerr commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

tom-jerr commented Dec 4, 2025 •

edited

Loading

tom-jerr commented Dec 7, 2025 •

edited

Loading

tom-jerr commented Dec 8, 2025 •

edited

Loading

tom-jerr commented Dec 11, 2025 •

edited

Loading

tom-jerr commented Dec 31, 2025 •

edited

Loading

mickqian commented Dec 31, 2025 •

edited

Loading