[AMD] Enable all diffusion models and fix encoder loading on MI325 by zyzshishui · Pull Request #13760 · sgl-project/sglang

zyzshishui · 2025-11-22T08:03:42Z

Done! Updated PR #13760 with the new description.

The CI should be running now with the rebased changes. The key fix for the ~100x slow loading is the should_offload() bug fix in component_loader.py.

gemini-code-assist · 2025-11-22T08:04:12Z

Summary of Changes

Hello @zyzshishui, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Sglang Diffusion's compatibility by integrating full support for AMD's ROCm platform. The changes enable efficient execution of diffusion models on AMD GPUs, primarily by adopting AITer as the default attention backend and implementing robust data type handling. It also streamlines the development environment setup with a new ROCm-specific Dockerfile and improves stability by localizing external dependencies and preventing profiler data conflicts. The overall impact is a more versatile and performant Sglang Diffusion for a broader range of hardware.

Highlights

ROCm Support for Sglang Diffusion: This pull request introduces comprehensive support for Sglang Diffusion on ROCm-enabled AMD GPUs, allowing the framework to leverage AMD hardware for diffusion workloads. This includes a new Dockerfile for building ROCm diffusion images and specific environment variable configurations.
AITer as Default Attention Backend on ROCm: The AITer backend is now set as the default attention mechanism for Sglang Diffusion when running on ROCm. This is intended to optimize performance and compatibility on AMD platforms.
Automatic Data Type Casting for AITer: The AITer backend now transparently casts fp32 Query, Key, and Value (QKV) tensors to bf16/fp16 and casts the outputs back to the original dtype. This resolves issues where CLIP/image encoders would crash due to unsupported data types.
Profiler Trace Overwrite Prevention: Profiler traces are now suffixed with rank IDs to prevent overwrites in multi-GPU or distributed environments, ensuring unique and accurate profiling data for each process.
Decoupling from Yunchang Library: The dependency on the yunchang library for sequence parallelism (sp_degree/ulysses_degree) has been removed. Relevant functions have been localized within the SGLang codebase to avoid torch not found errors when installing yunchang on ROCm.
Documentation Updates for ROCm: Installation documentation has been updated to include a dedicated ROCm quickstart guide, providing clear instructions for AMD Instinct/ROCm users on kernel builds and attention backend settings validated on MI300X GPUs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces ROCm support for SGLang Diffusion, a significant step towards broader hardware compatibility. The changes include a new Dockerfile for ROCm, the integration of the AITer attention backend, and various code modifications to ensure compatibility and remove problematic dependencies on ROCm. My review has identified a couple of critical issues—one in the Dockerfile that would cause build failures and another in the AITer backend implementation that could lead to runtime errors. I have also provided suggestions to improve Dockerfile efficiency and documentation clarity. Overall, this is a valuable contribution that enables SGLang Diffusion on AMD hardware.

docker/rocm.diffusion.Dockerfile

python/sglang/multimodal_gen/runtime/layers/attention/backends/aiter.py

docker/rocm.diffusion.Dockerfile

python/sglang/multimodal_gen/docs/cli.md

hubertlu-tw · 2025-11-23T17:07:33Z

/tag-and-rerun-ci 11/26

sabreshao · 2025-11-27T15:04:29Z

Automatic Data Type Casting for AITer: I suggest falling back to SDPA instead of AITER in CLIP or other model except DIT part to avoid some image incorrectness.

zhaochenyang20 · 2025-11-27T18:33:59Z

You are the GOAT!

guapisolo · 2025-12-06T00:39:15Z

/rerun-failed-ci

Co-authored-by: Sabre Shao <sabre.shao@amd.com> Co-authored-by: Yusheng (Ethan) Su <yushengsu.thu@gmail.com> Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

- Fix GPU OOM in sequential tests on ROCm/AMD with explicit memory cleanup - Skip Ring Attention tests on AMD/ROCm (unsupported) - Fix SGLANG_TEST_OUTPUT_SIZE not applied to actual test requests - Add MIOpen kernel caching for AMD VAE performance - Add diagnostics for HF cache and system resources - Add disk cleanup for non-persistent HF cache between tests - Enable all diffusion tests including LoRA (except FLUX.2 on 1-GPU)

The Docker image contains pre-compiled AITER kernels at /sgl-workspace/aiter/aiter/jit/ which may be incompatible. Clear them before running tests to force fresh JIT compilation.

mickqian · 2025-12-19T09:08:44Z

python/sglang/multimodal_gen/configs/sample/sampling_params.py

        if self.height is None:
            self.height_not_provided = True

+        # Allow env var to override num_inference_steps (for faster CI testing on AMD)


please fix it in a follow-up PR, this can be passed via sampling params

zyzshishui · 2026-01-05T22:55:12Z

Automatic Data Type Casting for AITer: I suggest falling back to SDPA instead of AITER in CLIP or other model except DIT part to avoid some image incorrectness.

Could you please provide more detail about this, under which circumstance would this cause an issue? Thanks!

zyzshishui requested review from Fridge003, HaiShaw, ishandhanani, ispobock and mickqian as code owners November 22, 2025 08:03

github-actions bot added documentation Improvements or additions to documentation amd dependencies Pull requests that update a dependency file diffusion SGLang Diffusion labels Nov 22, 2025

gemini-code-assist bot reviewed Nov 22, 2025

View reviewed changes

zyzshishui force-pushed the amd_diffusion branch 2 times, most recently from 1a9891b to 586b79b Compare November 22, 2025 08:26

sunxxuns added the run-ci label Nov 22, 2025

sunxxuns added run-ci and removed run-ci labels Nov 27, 2025

zyzshishui requested review from Kangyan-Zhou and merrymercy as code owners November 27, 2025 18:07

zyzshishui added a commit to zyzshishui/sglang that referenced this pull request Nov 27, 2025

fix sgl-project#13760 (comment)

a4e74e7

zyzshishui force-pushed the amd_diffusion branch from e3be8e3 to a4e74e7 Compare November 27, 2025 18:43

zyzshishui requested a review from yhyang201 as a code owner December 2, 2025 17:12

zyzshishui force-pushed the amd_diffusion branch 2 times, most recently from 81e4bf1 to 5b4d240 Compare December 3, 2025 03:49

This was referenced Dec 4, 2025

[DEBUG] Fix slow diffusion model loading with local_files_only #14389

Closed

[DEBUG][CI] Add multimodal diffusion support for MI325 #14477

Closed

zyzshishui and others added 27 commits December 19, 2025 03:43

1

6f4171e

remove source install method

97b3988

1

f54cc80

typo

9979fff

remove unused env

66578c7

fix "fa only support xx dtype"

4acc01f

set aiter default for rocm diffusion

dd2e810

support sp

bedffdb

Co-authored-by: Sabre Shao <sabre.shao@amd.com> Co-authored-by: Yusheng (Ethan) Su <yushengsu.thu@gmail.com> Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

fix

19a6fa9

remove dockerfile

408b45e

docker

2272e8d

Manually merge sgl-project#13743

9ac1c7b

Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>

fix sgl-project#13760 (comment)

0334bdb

doc

37b8eca

fix ci

66aa317

bump failed amd ci to mi325

7054566

bump openai version to 2.6.1 for all non-default pyproject

7bc9390

relax rocm diffusion ci

51d35a7

relax rocm diffusion ci

01948fa

Add runai_model_streamer

7c6dafc

change diffusers version

f52e228

[AMD] Clear pre-built AITER kernels to avoid segfaults

e7e23e9

The Docker image contains pre-compiled AITER kernels at /sgl-workspace/aiter/aiter/jit/ which may be incompatible. Clear them before running tests to force fresh JIT compilation.

Revert changes to test_deterministic.py - out of scope for this PR

bd8c65b

Revert NVIDIA test changes, keep only AMD test additions

dc3be12

Revert test/srt/run_suite.py - out of scope for diffusion PR

161cc37

docs: Add AMD/ROCm multi-platform support section to README

09cb210

mickqian reviewed Dec 19, 2025

View reviewed changes

mickqian mentioned this pull request Jan 8, 2026

[Diffusion] add rocm support #13492

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable all diffusion models and fix encoder loading on MI325#13760

[AMD] Enable all diffusion models and fix encoder loading on MI325#13760
mickqian merged 27 commits intosgl-project:mainfrom
zyzshishui:amd_diffusion

zyzshishui commented Nov 22, 2025 •

edited by sunxxuns

Loading

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hubertlu-tw commented Nov 23, 2025 •

edited by HaiShaw

Loading

Uh oh!

sabreshao commented Nov 27, 2025

Uh oh!

zhaochenyang20 commented Nov 27, 2025

Uh oh!

guapisolo commented Dec 6, 2025

Uh oh!

mickqian Dec 19, 2025

Uh oh!

zyzshishui commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

zyzshishui commented Nov 22, 2025 • edited by sunxxuns Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hubertlu-tw commented Nov 23, 2025 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sabreshao commented Nov 27, 2025

Uh oh!

zhaochenyang20 commented Nov 27, 2025

Uh oh!

guapisolo commented Dec 6, 2025

Uh oh!

mickqian Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

zyzshishui commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zyzshishui commented Nov 22, 2025 •

edited by sunxxuns

Loading

hubertlu-tw commented Nov 23, 2025 •

edited by HaiShaw

Loading