Skip to content

[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step#14410

Closed
tom-jerr wants to merge 7 commits intosgl-project:mainfrom
tom-jerr:diffusion-warmup-new
Closed

[diffusion] pipeline: lightweight warmup, denoising stage only, 1-step#14410
tom-jerr wants to merge 7 commits intosgl-project:mainfrom
tom-jerr:diffusion-warmup-new

Conversation

@tom-jerr
Copy link
Contributor

@tom-jerr tom-jerr commented Dec 4, 2025

Motivation

lightweight warmup, denoising stage only, 1-step #13692

Modifications

New Parameter Control:

  • Added the --enable-warmup parameter in server_args.py.

  • The default value is False. Users can explicitly enable the warmup function via this parameter.

Implemented Warmup Logic for Denoising Stages:

  • The _warmup method has been implemented in the following three core Denoising Stage classes. It is conditionally invoked before the forward pass based on the parameter settings.

    • DenoisingStage (denoising.py)

    • DmdDenoisingStage (denoising_dmd.py)

⚠️ Attention:
I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance.

I think that warm-up reduces the average execution time of each subsequent step, but since the denoising stage has too few steps (3 steps), the cost of adding a warm-up outweighs the benefits it brings.

Enable torch compile:
from #13641, just use the max-autotune-no-cudagraphs compile mode.

Benchmarking and Profiling

All tests were performed on the H100. I fixed code to enable torch.compile.(most code from #13641).

Now I just tested baseline and warm up.

Qwen/Qwen-Image Results

0. Denoising Stage Result

  Baseline Warm Up
Time per denoising step in torch profile 727ms 485ms
E2E latency per denoising step 1.0839s 0.7023s
  • Baseline
image
  • Warm Up
image

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 64487.07 ms 44302.05 ms -20185.03 ms (-31.3%)
Throughput 0.02 req/s 0.02 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) Warm Up (ms) Diff (ms) Diff (%)
InputValidationStage 0.24 0.06 -0.18 -74.8%
TextEncodingStage 1484.17 1537.20 +53.03 +3.6%
ConditioningStage 0.04 0.01 -0.03 -70.5%
TimestepPreparationStage 59.99 25.27 -34.72 -57.9%
LatentPreparationStage 0.58 0.26 -0.32 -54.7%
DenoisingStage 62248.00 41786.81 -20461.18 -32.9%
DecodingStage 669.70 942.67 +272.97 +40.8%

FastVideo / FastWan2.1-T2V-1.3B-Diffusers Results

Warm-up reduces the average execution time of each subsequent step, but since the denoising stage has too few steps (3 steps), the cost of adding a warm-up outweighs the benefits it brings.

0. Denoising Stage Result

  Baseline Warm Up
E2E latency per denoising step 2.2530s 1.3381s

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 15062.46 ms 15955.15 ms +892.69 ms (+5.9%)
Throughput 0.07 req/s 0.06 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) Warm Up (ms) Diff (ms) Diff (%)
InputValidationStage 0.05 0.06 +0.01 +16.1%
TextEncodingStage 4445.95 4705.35 +259.40 +5.8%
ConditioningStage 0.02 0.02 +0.00 +30.2%
TimestepPreparationStage 32.56 29.05 -3.52 -10.8%
LatentPreparationStage 0.88 0.97 +0.10 +11.2%
DenoisingStage 6123.29 7070.25 +946.96 +15.5%
DecodingStage 4452.30 4141.41 -310.89 -7.0%

Checklist

@github-actions github-actions bot added the diffusion SGLang Diffusion label Dec 4, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tom-jerr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a lightweight, single-step warmup mechanism for the denoising stages of diffusion pipelines. The primary goal is to enhance performance, particularly when leveraging torch.compile, by preparing the model for efficient execution. The changes include a new configuration parameter to enable this feature and the integration of the warmup logic into core denoising components, demonstrating notable latency improvements in various benchmarks.

Highlights

  • New Warmup Feature: Introduced a new --enable-warmup command-line parameter and a corresponding enable_warmup flag in ServerArgs to control a lightweight, 1-step warmup mechanism for diffusion pipelines. This feature is disabled by default.
  • Denoising Stage Warmup: Implemented a _warmup method within the DenoisingStage and DmdDenoisingStage classes. This method performs a single-step forward pass to pre-compile or optimize the model's execution graph, specifically targeting the denoising process.
  • Torch Compile Integration Refinement: Refactored the torch.compile logic into a reusable torch_compile_module method, which also attempts to enable inductor overlap tweak for potential further optimization. This change streamlines how modules are compiled and ensures consistent application of torch.compile settings.
  • Performance Improvements: Benchmarking shows significant end-to-end latency reductions (up to 49.2% for FastHunyuan-diffusers) when the warmup is enabled in conjunction with torch.compile. However, enabling warmup without torch.compile can lead to performance degradation for some models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lightweight, 1-step warmup for the denoising stage, controlled by a new --enable-warmup flag. This is implemented in DenoisingStage and DmdDenoisingStage. The changes also include a refactoring of torch.compile logic into a reusable helper method and improvements to argument handling for compiled functions. The benchmarks show significant performance gains when warmup is used with torch.compile.

My feedback focuses on a critical bug in causal_denoising.py that could lead to a NameError, and several instances of commented-out code that should be removed for better maintainability.

)
current_start_frame += 1
remaining_frames = input_frames - 1
input_frames -= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces a bug. The variable remaining_frames is used in the while loop at line 188, but it is no longer defined within this if block. This will cause a NameError when independent_first_frame is true and input_frames >= 1. Please revert this line to its original state to ensure remaining_frames is correctly initialized.

Suggested change
input_frames -= 1
remaining_frames = input_frames - 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix it

Comment on lines 340 to 343
# if self.server_args.enable_torch_compile:
# self.transformer = torch.compile(
# self.transformer, mode="max-autotune", fullgraph=True
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commented-out code should be removed to keep the codebase clean.

@tom-jerr tom-jerr force-pushed the diffusion-warmup-new branch 2 times, most recently from 2d07b50 to c831e0f Compare December 4, 2025 13:21
@tom-jerr tom-jerr marked this pull request as draft December 4, 2025 14:02
@tom-jerr tom-jerr force-pushed the diffusion-warmup-new branch from c831e0f to 1b2e5a7 Compare December 5, 2025 06:46
@tom-jerr tom-jerr marked this pull request as ready for review December 5, 2025 06:46
@mickqian
Copy link
Collaborator

mickqian commented Dec 7, 2025

Could you also make sure, if it's necessary to expand the warm-up to all stages? Thanks!

@tom-jerr
Copy link
Contributor Author

tom-jerr commented Dec 7, 2025

Could you also make sure, if it's necessary to expand the warm-up to all stages? Thanks!

I will verify it.

@tom-jerr
Copy link
Contributor Author

tom-jerr commented Dec 8, 2025

Conclusion

I added warm-up to the text encoding and decoding stages, but these stages and end-to-end latency increased significantly.

Adding warm-up to other stages may not be a good idea.

Performance Comparison Report

Disable Torch Compile Performance Comparison Report

0. Denosing Stage

  • Baseline
[12-08 06:17:05] [DenoisingStage] average time per step: 0.7551 seconds
[12-08 06:17:05] [DenoisingStage] finished in 37.7647 seconds
  • Warm Up
[12-08 06:18:53] [DenoisingStage] average time per step: 0.5881 seconds
[12-08 06:18:53] [DenoisingStage] finished in 32.1070 seconds

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 39253.60 ms 34340.59 ms -4913.01 ms (-12.5%)
Throughput 0.03 req/s 0.03 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) New (ms) Diff (ms) Diff (%) Status
InputValidationStage 0.09 0.06 -0.03 -34.3% ⚪️
TextEncodingStage 1032.05 1588.60 +556.55 +53.9% 🔴
ConditioningStage 0.01 0.01 -0.00 -25.4% ⚪️
TimestepPreparationStage 31.91 17.48 -14.43 -45.2% ⚪️
LatentPreparationStage 0.27 0.22 -0.05 -18.0% ⚪️
DenoisingStage 37764.52 32106.88 -5657.63 -15.0% 🟢
DecodingStage 415.12 620.19 +205.07 +49.4% 🔴

Enabel Torch Compile Performance Comparison Report

0. Denosing Stage

  • Baseline
[12-08 06:21:17] [DenoisingStage] average time per step: 1.6562 seconds
[12-08 06:21:17] [DenoisingStage] finished in 82.8159 seconds
  • Warm Up
[12-08 06:23:42] [DenoisingStage] average time per step: 0.9021 seconds
[12-08 06:23:42] [DenoisingStage] finished in 79.3291 seconds

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 84261.16 ms 81671.65 ms -2589.51 ms (-3.1%)
Throughput 0.01 req/s 0.01 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) New (ms) Diff (ms) Diff (%) Status
InputValidationStage 0.05 0.14 +0.08 +149.2% ⚪️
TextEncodingStage 920.33 1733.51 +813.18 +88.4% 🔴
ConditioningStage 0.01 0.01 +0.00 +45.9% ⚪️
TimestepPreparationStage 14.56 26.71 +12.15 +83.4% ⚪️
LatentPreparationStage 0.20 0.30 +0.10 +47.9% ⚪️
DenoisingStage 82815.79 79328.94 -3486.86 -4.2% ⚪️
DecodingStage 501.43 568.73 +67.30 +13.4% ⚪️

@mickqian
Copy link
Collaborator

mickqian commented Dec 9, 2025

We might be considering no torch-compile for now.

I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance; however, enabling --enable-torch-compile

How is that, could you do multiple tests and dump the perf comparison report?

@tom-jerr
Copy link
Contributor Author

tom-jerr commented Dec 11, 2025

We might be considering no torch-compile for now.

I found that for the FastWan2.1-T2V-1.3B-Diffusers video generation model, enabling warm-up without enabling torch compile results in poor performance; however, enabling --enable-torch-compile

How is that, could you do multiple tests and dump the perf comparison report?

The more detailed information is here. And I tested multiple times.

pass
mode = os.environ.get("SGLANG_TORCH_COMPILE_MODE", "max-autotune-no-cudagraphs")
compiled_forward = torch.compile(getattr(module, "forward"), mode=mode)
setattr(module, "forward", compiled_forward)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add log records explaining the used compilation mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this and fix bug below.

)
current_start_frame += 1
remaining_frames = input_frames - 1
input_frames -= 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug?

Update enable torch.compile and fix denoisingdmd bug

Signed-off-by: lzy <tomlzy213@gmail.com>
@tom-jerr tom-jerr force-pushed the diffusion-warmup-new branch from a23eaf2 to bd917da Compare December 28, 2025 15:24
@mickqian
Copy link
Collaborator

/gemini please carefully review this PR, this PR is trying to do a complete and thorough warmup that in actual forward, there's no compile or warmup completely

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@tom-jerr
Copy link
Contributor Author

tom-jerr commented Dec 31, 2025

I found PR #15773 warms up by doing multiple rounds generating before actually generating.

Does that method conflict with the denoising stage lightweight warm-up?
I’m a bit confused right now — which operations are we actually warming up for?@mickqian @yhyang201

@fsygd
Copy link
Contributor

fsygd commented Dec 31, 2025

I found PR #15773 warms up by doing multiple rounds generating before actually generating.

Does that method conflict with the denoising stage lightweight warm-up? I’m a bit confused right now — which operations are we actually warming up for?@mickqian @yhyang201

I'll find you by wechat.

@mickqian
Copy link
Collaborator

mickqian commented Dec 31, 2025

@tom-jerr I've made some updates to this PR.

To sum up, now we insert an identical warmup req with --num-inference-steps=1, before actual req

@mickqian
Copy link
Collaborator

/tag-and-rerun-ci

@tom-jerr
Copy link
Contributor Author

@tom-jerr I've made some updates to this PR.

To sum up, now we insert an identical warmup req with --num-inference-steps=1, before actual req

I think I understand your point now, and I’ll rerun the tests based on these changes.

I’ve discussed this with fsygd, and I‘m wondering whether the current implementation can be orthogonal to PR #15773, or if we should discuss this further.

@mickqian
Copy link
Collaborator

duplicate with #16213

@mickqian mickqian closed this Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments