Skip to content

Comments

[MultiModal][Feat] multimodal develop - support Wan2.1#593

Merged
pengchengneo merged 2 commits intosgl-project:mainfrom
primatrix:epic/multimodal-support
Jan 19, 2026
Merged

[MultiModal][Feat] multimodal develop - support Wan2.1#593
pengchengneo merged 2 commits intosgl-project:mainfrom
primatrix:epic/multimodal-support

Conversation

@pengchengneo
Copy link
Collaborator

@pengchengneo pengchengneo commented Dec 23, 2025

Basic MultiModal Features Roadmap

  • User Interface Refactor
    • HTTP Requests/Tokenizer/Detokenizer (Contract common abstract class for multi schema request)
  • Launch Server
    • Abstract Class Definition, basic component develop
    • WeightLoader Util refactor, make it compatible for various multimodal models @andy1126

Wan2.1 Support Work Break Down

  • Diffusion Engine Develop (under multimodal folder) @SiqiLi-Fighting
    • Support Naive Diffusion Engine without any optimized features
  • Vae Stage Develop (under multimodal folder) @pathfinder-pf
  • T5 Stage Develop (under autoregressive/text folder) @SII-limingliu
    • refactor AR stage's some interface to fit multimodal
  • GlobalScheduler and communication within stages
  • Unit test / e2e test / add to CI
  • model evaluation

Currently, We support two models:
a. Wan-AI/Wan2.1-T2V-1.3B-Diffusers
b. Wan-AI/Wan2.1-T2V-14B-Diffusers

Model Evaluation Results

Test Command

Environment:tpu-v6e-4

uv run  python3 -u -m sgl_jax.launch_server --multimodal --model-path=Wan-AI/Wan2.1-T2V-14B-Diffusers  --log-requests
uv run  python3 -u -m sgl_jax.launch_server --multimodal --model-path=Wan-AI/Wan2.1-T2V-1.3B-Diffusers  --log-requests

1.3B/14B Image

curl http://localhost:30000/api/v1/images/generation -H "Content-Type: application/json" -d '{"prompt": "A curious raccoon", "size": "480*832"}'

1.3B Video

curl http://localhost:30000/api/v1/videos/generation -H "Content-Type: application/json" -d '{"prompt": "A curious raccoon", "size": "480*832", "num_frames": 41}'

14B Video (this model still need optimization to support large num_frames video)

curl http://localhost:30000/api/v1/videos/generation -H "Content-Type: application/json" -d '{"prompt": "A curious raccoon", "size": "480*832", "num_frames": 5}'

Test Result

Model Name Generated Image (Preview) Generated Video
Wan-AI/Wan2.1-T2V-1.3B-Diffusers 1.3B Image ▶️ Click to Watch Video
Wan-AI/Wan2.1-T2V-14B-Diffusers 14B Image ▶️ Click to Watch Video

@gemini-code-assist
Copy link

Summary of Changes

Hello @SiqiLi-Fighting, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the groundwork for a robust multimodal inference framework within sgl-jax, enabling efficient processing of complex models like Wan2.1. The core idea is to create a flexible, high-performance system that can seamlessly integrate various computational stages, such as autoregressive decoding and diffusion denoising, by adopting a thread-based, single-process architecture. This change introduces new server configurations, API endpoints for image and video generation, and a modular stage-based execution pipeline, significantly expanding the platform's capabilities beyond text-only models.

Highlights

  • Multimodal Framework Introduction: Implemented a new, unified, high-performance inference framework for next-generation multimodal models (e.g., Wan2.1, Qwen2.5VL, MiMo-Audio, Qwen-Omni, Ling-Omni).
  • Modular Architecture: Designed with an "Operating System" philosophy, separating the control plane (Global Scheduler) from the computation plane (Device Stages) to support heterogeneous compute patterns like AR decoding and Diffusion denoising.
  • Thread-Based SPMD Concurrency: Utilizes a Single Process, Multiple Data (SPMD) logic with multi-threading to minimize inter-process communication overhead and maximize parallel throughput.
  • Multimodal Server Arguments: Introduced MultimodalServerArgs to configure multimodal-specific parameters, including precision settings for DiT, VAE, and various encoders.
  • New API Endpoints: Added HTTP API routes for image and video generation (/api/v1/images/generation, /api/v1/videos/generation).
  • Stage-Based Execution: Implemented a Stage abstraction, DeviceManager, GlobalScheduler, and specialized schedulers (DiffusionScheduler, VaeScheduler) to manage and execute different model components.
  • Multimodal Tokenization/Detokenization: Introduced dedicated MultimodalTokenizer and MultimodalDetokenizer components to handle complex multimodal input/output schemas.
  • Design Documentation: Included a detailed design document ([RFC]multimodal_architechure.md) outlining the framework's principles and components.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@pengchengneo pengchengneo marked this pull request as draft December 23, 2025 10:18
@pengchengneo pengchengneo changed the title [Feat] multimodal develop - support Wan2.1 [WIP][Feat] multimodal develop - support Wan2.1 Dec 23, 2025
@pengchengneo pengchengneo linked an issue Dec 30, 2025 that may be closed by this pull request
13 tasks
@pengchengneo pengchengneo force-pushed the epic/multimodal-support branch from a775359 to 14aa623 Compare January 14, 2026 11:54
@pengchengneo pengchengneo marked this pull request as ready for review January 14, 2026 11:54
@pengchengneo pengchengneo force-pushed the epic/multimodal-support branch from 14aa623 to 4d1b1c0 Compare January 14, 2026 11:55
@pengchengneo pengchengneo changed the title [WIP][Feat] multimodal develop - support Wan2.1 [MultiModal][Feat] multimodal develop - support Wan2.1 Jan 14, 2026
Co-authored-by: pathfinder-fp <slackexplorer@gmail.com>
@pengchengneo pengchengneo force-pushed the epic/multimodal-support branch from 177674f to a13da64 Compare January 19, 2026 04:21
Copy link
Collaborator

@sii-xinglong sii-xinglong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pengchengneo pengchengneo merged commit 456a113 into sgl-project:main Jan 19, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Support multimodal model in SGL_JAX [Feature] Multi Modal Models Support

5 participants