Skip to content

feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot#196

Merged
weedge merged 5 commits intomainfrom
feat/vision_voice
Sep 28, 2025
Merged

feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot#196
weedge merged 5 commits intomainfrom
feat/vision_voice

Conversation

@weedge
Copy link
Collaborator

@weedge weedge commented Sep 26, 2025

Tip

  • 生成音频中未对特殊字符进行处理(omni统一到一起直接生成音频的弊端(TTS任务), 也许可以在隐藏层解决, 系统提示词限制貌似不起作用(提示不含特殊字符), 比如:
    * **优点**:这款饮料是专门为运动后设计的。它的核心成分是电解质,标签上明确写着“电解质≥200mg”,这能有效补充运动时因大量出汗流失的钠、钾等矿物质,帮助维持体液平衡,防止抽筋。同时,它也含有维生素E和维生素B6,有助于能量代谢。它还是0糖0卡的,不用担心额外的热量。
  • hf transformers pytorch 实现的qwen3_omni 适合了解模型结构, 并发流式设计可以看Qwen3Omni技术报告了解,⭐️ Designs for Streaming and Concurrency; 结合vllm, 现在仅公开thinker的实现,talker 和 code2wav 暂未公开;vllm官方也许启动另一个项目来支持(多模态融合统一训练成的Omni模型在推理上进行异构拆解优化,支持包含多个LLM进行推理优化(包括MTP aux_hidden_states),涉及到隐藏层的hidden_states交互(比如通过队列zmq), 可插拔异构推理设计, 输出多模态内容或者hidden_states;code2wav 这块不知是否支持; 复用vlllm对Qwen3-MoE已有推理优化。 比如: Qwen2Omni/Qwen3Omni 使用了两个LLM(thinker 和 talker 之间隐藏层的交互); Qwen3Omni团队不知是否会公开Designs for Streaming and Concurrency这块代码, 可能等社区实现了,会同步吧)
  • 如果 使用 vllm 支持 Omni thinker -> talker -> code2wav pipeline, 可以修改对应不同硬件芯片支持的worker.model_runner 中的 execute_model 方法, 比如:GPUModelRunner.execute_model 方法,获取注册HF config.json中的Qwen3OmniMoeForConditionalGeneration类(已支持多模态输入理解的thinker),进行推理(forward),其中MoE LM的推理使用torch.compile进行了编译优化见(启动时需要预热)代码实现
  • 这里实现 hf transformers pytorch Qwen3Omni thinker chunk stream, 粒度比较粗糙,主要是了解Qwen3Omni模型结构和流式推理,对比Qwen2.5Omni (Qwen2.5Omni 分析PR 见:feat: add qwen2.5-omni #143) :
    • Audio-Encoder 采用从新训练的AuT,使用Encoder模块(可能是未公开的FunASR最新模型结构);(如果单独 训练 多模态语音理解、ASR 模块可以借鉴)
    • Thinker 和 Talker LLM Decoder模型结构采用MoE结构(稀疏模型,训练压缩权重大,但是推理时激活参数相对稠密模型小,计算成本低(FLOPs),推理速度快)(膜拜下google大佬);
    • Talker采用MTP优化Streaming Multi-Codebook Codec 的生成速度(1 Linear + 15 (Dense Transformer))(可以参考VITA-Audio 分析PR: feat: add VITA-Audio #146 );
    • Code2Wav 使用 ConvNets 结构, 流式推理速度更快;(如果单独 训练 TTS 模块可以借鉴)

feat:

  • add qwen3_omni transformers/vllm cookbook cases on modal test
# 0. download models and assets
modal run src/download_models.py --repo-ids "Qwen/Qwen3-Omni-30B-A3B-Instruct"

# vllm
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task asr
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task speech_translation
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task image_question
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task audio_interaction
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task audio_interaction_scene (chat)
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task video_interaction
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task video_interaction_scene (video include audio chat)
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task video_information_extracting
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task image_audio_interaction
IMAGE_GPU=A100-80GB modal run src/llm/vllm/qwen3_omni_fork.py --task audio_function_call
IMAGE_GPU=B200 modal run src/llm/vllm/qwen3_omni_fork.py --task text_image_video_audio_interaction
IMAGE_GPU=B200 modal run src/llm/vllm/qwen3_omni_fork.py --task batch_requests
  • add transformers qwen3_omni thinker chunk stream cases on modal test
# HF transformers
modal run src/llm/transformers/qwen3_omni.py --task tokenizer
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task dump_model
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task asr
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task text2speech
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task speech_translation
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task image_question
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task audio_interaction
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task audio_interaction_scene (speech chat)
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task video_interaction
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task video_interaction_scene (video include audio chat)
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task video_information_extracting
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task image_text_interaction
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task image_audio_interaction
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task audio_function_call
IMAGE_GPU=B200 modal run src/llm/transformers/qwen3_omni.py --task text_image_video_audio_interaction
IMAGE_GPU=B200 modal run src/llm/transformers/qwen3_omni.py --task batch_requests

# think chunk stream
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task text2text_stream
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task text2speech_stream
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task image_text_interaction_stream
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task image_audio_interaction_stream
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task audio_interaction_scene_stream (speech chat)
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task video_interaction_scene_stream (video include audio chat)

IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task code2wav
IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task code2wav_stream

ACHATBOT_VERSION=0.0.26.post1 IMAGE_GPU=A100-80GB modal run src/llm/transformers/qwen3_omni.py --task achatbot_generate
  • add livekit webrtc room vision voice bot with transformers (slow)
# 0. download models and assets
modal run src/download_models.py --repo-ids "Qwen/Qwen3-Omni-30B-A3B-Instruct"

--------------

# 1. run webrtc room http bots server

IMAGE_GPU=A100-80GB SERVER_TAG=fastapi_webrtc_bots \
    ACHATBOT_VERSION=0.0.26.post1 \
    modal serve src/fastapi_webrtc_qwen3omni_vision_voice_bot_serve.py

--------------

# 2. run webrtc room http signal bot server

modal volume create config
modal volume put config ./config/bots/livekit_qwen3omni_vision_voice_bot.json /bots/ -f


## run container with gpu
IMAGE_GPU=A100-80GB SERVER_TAG=fastapi_webrtc_bots \
    ACHATBOT_VERSION=0.0.26.post1 \
    CONFIG_FILE=/root/.achatbot/config/bots/livekit_qwen3omni_vision_voice_bot.json \
    modal serve src/fastapi_webrtc_qwen3omni_vision_voice_bot_serve.py

## cold start fastapi webrtc http server
curl -v -XGET "https://weedge--qwen3omni-bot-srv-app-dev.modal.run/health"

## run bot
curl -XPOST "https://weedge--qwen3omni-bot-srv-app-dev.modal.run/bot_join/chat-room/LivekitQwen3OmniVisionVoiceBot"

livekit_qwen3omni_vision_voice_bot.json

{
  "chat_bot_name": "LivekitQwen3OmniVisionVoiceBot",
  "handle_sigint": true,
  "is_background": false,
  "save_audio": true,
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "livekit_room",
    "args": {
      "bot_name": "LivekitQwen3OmniVisionVoiceBot",
      "is_common_session": false
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "omni_llm": "llm_transformers_manual_qwen3omni_vision_voice"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.032,
        "stop_secs": 0.32,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "omni_llm": {
      "processor": "Qwen3OmnVisionVoiceProcessor",
      "tag": "llm_transformers_manual_qwen3omni_vision_voice",
      "args": {
        "no_stream_sleep_time": 0.0,
        "lm_device": "cuda",
        "lm_torch_dtype": "bfloat16",
        "lm_attn_impl": "flash_attention_2",
        "warmup_steps": 2,
        "chat_history_size": 0,
        "thinker_eos_token_ids": [151643, 151645],
        "thinker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 1024,
          "lm_gen_max_tokens_per_step": 10,
          "lm_gen_repetition_penalty": 1.1
        },
        "talker_args": {
          "lm_gen_temperature": 0.95,
          "lm_gen_top_k": 20,
          "lm_gen_top_p": 0.9,
          "lm_gen_min_new_tokens": 1,
          "lm_gen_max_new_tokens": 2048,
          "lm_gen_repetition_penalty": 1.1
        },
        "code2wav_args": {
          "chunk_size": 50,
          "left_context_size": 25
        },
        "speaker": "Chelsie",
        "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen3-Omni-30B-A3B-Instruct"
      }
    }
  },
  "config_list": []
}

AI podcast:


HF Transformers qwen3_omni_moe : https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py

image

Thinker

Qwen3OmniMoeForConditionalGeneration.thinker 31719.205488 M parameters

Qwen3OmniMoeThinkerForConditionalGeneration(
  (audio_tower): Qwen3OmniMoeAudioEncoder(
    (positional_embedding): SinusoidsPositionEmbedding()
    (layers): ModuleList(
      (0-31): 32 x Qwen3OmniMoeAudioEncoderLayer(
        (self_attn): Qwen3OmniMoeAudioAttention(
          (k_proj): Linear(in_features=1280, out_features=1280, bias=True)
          (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
          (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
          (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=1280, out_features=5120, bias=True)
        (fc2): Linear(in_features=5120, out_features=1280, bias=True)
        (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    (conv2d1): Conv2d(1, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (conv2d2): Conv2d(480, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (conv2d3): Conv2d(480, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (conv_out): Linear(in_features=7680, out_features=1280, bias=False)
    (proj1): Linear(in_features=1280, out_features=1280, bias=True)
    (act): GELUActivation()
    (proj2): Linear(in_features=1280, out_features=2048, bias=True)
  )
  (visual): Qwen3OmniMoeVisionEncoder(
    (merger_list): ModuleList(
      (0-2): 3 x Qwen3OmniMoeVisionPatchMerger(
        (ln_q): LayerNorm((4608,), eps=1e-06, elementwise_affine=True)
        (mlp): ModuleList(
          (0): Linear(in_features=4608, out_features=4608, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=4608, out_features=2048, bias=True)
        )
      )
    )
    (patch_embed): Qwen3OmniMoeVisionPatchEmbed(
      (proj): Conv3d(3, 1152, kernel_size=(2, 16, 16), stride=(2, 16, 16))
    )
    (pos_embed): Embedding(2304, 1152)
    (rotary_pos_emb): Qwen3OmniMoeVisionRotaryEmbedding()
    (blocks): ModuleList(
      (0-26): 27 x Qwen3OmniMoeVisionBlock(
        (norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
        (norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
        (attn): Qwen3OmniMoeVisionAttention(
          (qkv): Linear(in_features=1152, out_features=3456, bias=True)
          (proj): Linear(in_features=1152, out_features=1152, bias=True)
        )
        (mlp): Qwen3OmniMoeVisionMLP(
          (linear_fc1): Linear(in_features=1152, out_features=4304, bias=True)
          (linear_fc2): Linear(in_features=4304, out_features=1152, bias=True)
          (act_fn): PytorchGELUTanh()
        )
      )
    )
    (merger): Qwen3OmniMoeVisionPatchMerger(
      (ln_q): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
      (mlp): ModuleList(
        (0): Linear(in_features=4608, out_features=4608, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=4608, out_features=2048, bias=True)
      )
    )
  )
  (model): Qwen3OmniMoeThinkerTextModel(
    (embed_tokens): Embedding(152064, 2048)
    (layers): ModuleList(
      (0-47): 48 x Qwen3OmniMoeThinkerTextDecoderLayer(
        (self_attn): Qwen3OmniMoeThinkerTextAttention(
          (q_proj): Linear(in_features=2048, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2048, bias=False)
          (q_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3OmniMoeThinkerTextSparseMoeBlock(
          (gate): Linear(in_features=2048, out_features=128, bias=False)
          (experts): ModuleList(
            (0-127): 128 x Qwen3OmniMoeThinkerTextMLP(
              (gate_proj): Linear(in_features=2048, out_features=768, bias=False)
              (up_proj): Linear(in_features=2048, out_features=768, bias=False)
              (down_proj): Linear(in_features=768, out_features=2048, bias=False)
              (act_fn): SiLU()
            )
          )
        )
        (input_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen3OmniMoeTextRMSNorm((2048,), eps=1e-06)
    (rotary_emb): Qwen3OmniMoeThinkerTextRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=152064, bias=False)
)

Audio Encoder (AuT)

image

Talker

Qwen3OmniMoeForConditionalGeneration.talker 3324.59648 M parameters

Qwen3OmniMoeTalkerForConditionalGeneration(
  (model): Qwen3OmniMoeTalkerModel(
    (layers): ModuleList(
      (0-19): 20 x Qwen3OmniMoeTalkerDecoderLayer(
        (self_attn): Qwen3OmniMoeThinkerTextAttention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=256, bias=False)
          (v_proj): Linear(in_features=1024, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3OmniMoeTalkerTextSparseMoeBlock(
          (gate): Linear(in_features=1024, out_features=128, bias=False)
          (experts): ModuleList(
            (0-127): 128 x Qwen3OmniMoeTalkerTextMLP(
              (gate_proj): Linear(in_features=1024, out_features=384, bias=False)
              (up_proj): Linear(in_features=1024, out_features=384, bias=False)
              (down_proj): Linear(in_features=384, out_features=1024, bias=False)
              (act_fn): SiLU()
            )
          )
          (shared_expert): Qwen3OmniMoeTalkerTextMLP(
            (gate_proj): Linear(in_features=1024, out_features=768, bias=False)
            (up_proj): Linear(in_features=1024, out_features=768, bias=False)
            (down_proj): Linear(in_features=768, out_features=1024, bias=False)
            (act_fn): SiLU()
          )
          (shared_expert_gate): Linear(in_features=1024, out_features=1, bias=False)
        )
        (input_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((1024,), eps=1e-06)
      )
    )
    (norm): Qwen3OmniMoeTextRMSNorm((1024,), eps=1e-06)
    (rotary_emb): Qwen3OmniMoeTalkerRotaryEmbedding()
    (codec_embedding): Embedding(3072, 1024)
  )
  (text_projection): Qwen3OmniMoeTalkerResizeMLP(
    (linear_fc1): Linear(in_features=2048, out_features=2048, bias=True)
    (linear_fc2): Linear(in_features=2048, out_features=1024, bias=True)
    (act_fn): SiLU()
  )
  (hidden_projection): Qwen3OmniMoeTalkerResizeMLP(
    (linear_fc1): Linear(in_features=2048, out_features=2048, bias=True)
    (linear_fc2): Linear(in_features=2048, out_features=1024, bias=True)
    (act_fn): SiLU()
  )
  (codec_head): Linear(in_features=1024, out_features=3072, bias=False)
  (code_predictor): Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration(
    (model): Qwen3OmniMoeTalkerCodePredictorModel(
      (layers): ModuleList(
        (0-4): 5 x Qwen3OmniMoeTalkerCodePredictorDecoderLayer(
          (self_attn): Qwen3OmniMoeTalkerCodePredictorAttention(
            (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
            (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
            (q_norm): Qwen3OmniMoeRMSNorm((128,), eps=1e-06)
            (k_norm): Qwen3OmniMoeRMSNorm((128,), eps=1e-06)
          )
          (mlp): Qwen3OmniMoeMLP(
            (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
            (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
            (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
          (post_attention_layernorm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
        )
      )
      (norm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
      (rotary_emb): Qwen3OmniMoeRotaryEmbedding()
      (codec_embedding): ModuleList(
        (0-14): 15 x Embedding(2048, 1024)
      )
    )
    (lm_head): ModuleList(
      (0-14): 15 x Linear(in_features=1024, out_features=2048, bias=False)
    )
  )
)

Streaming Codec Decoder (code2wav)

Qwen3OmniMoeForConditionalGeneration.code2wav 216.016577 M parameters

Qwen3OmniMoeCode2Wav(
  (pre_transformer): Qwen3OmniMoeCode2WavTransformerModel(
    (layers): ModuleList(
      (0-7): 8 x Qwen3OmniMoeCode2WavTransformerLayer(
        (self_attn): Qwen3OmniMoeCode2WavAttention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (q_norm): Identity()
          (k_norm): Identity()
        )
        (mlp): Qwen3OmniMoeCode2WavMlp(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3OmniMoeCode2WavRMSNorm((1024,), eps=1e-05)
        (post_attention_layernorm): Qwen3OmniMoeCode2WavRMSNorm((1024,), eps=1e-05)
        (self_attn_layer_scale): Qwen3OmniMoeCode2WavLayerScale()
        (mlp_layer_scale): Qwen3OmniMoeCode2WavLayerScale()
      )
    )
    (norm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-05)
    (rotary_emb): Qwen3OmniMoeRotaryEmbedding()
  )
  (code_embedding): Embedding(32768, 1024)
  (upsample): ModuleList(
    (0-1): 2 x ModuleList(
      (0): Qwen3OmniMoeCausalTransConvNet(
        (conv): ConvTranspose1d(1024, 1024, kernel_size=(2,), stride=(2,))
      )
      (1): Qwen3OmniMoeConvNeXtBlock(
        (dwconv): Qwen3OmniMoeCausalConvNet(
          (conv): Conv1d(1024, 1024, kernel_size=(7,), stride=(1,), groups=1024)
        )
        (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (pwconv1): Linear(in_features=1024, out_features=4096, bias=True)
        (act): GELU(approximate='none')
        (pwconv2): Linear(in_features=4096, out_features=1024, bias=True)
      )
    )
  )
  (decoder): ModuleList(
    (0): Qwen3OmniMoeCausalConvNet(
      (conv): Conv1d(1024, 1536, kernel_size=(7,), stride=(1,))
    )
    (1): Qwen3OmniMoeCode2WavDecoderBlock(
      (block): ModuleList(
        (0): SnakeBeta()
        (1): Qwen3OmniMoeCausalTransConvNet(
          (conv): ConvTranspose1d(1536, 768, kernel_size=(16,), stride=(8,))
        )
        (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
          )
        )
        (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,), dilation=(3,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
          )
        )
        (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,), dilation=(9,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
          )
        )
      )
    )
    (2): Qwen3OmniMoeCode2WavDecoderBlock(
      (block): ModuleList(
        (0): SnakeBeta()
        (1): Qwen3OmniMoeCausalTransConvNet(
          (conv): ConvTranspose1d(768, 384, kernel_size=(10,), stride=(5,))
        )
        (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
          )
        )
        (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,), dilation=(3,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
          )
        )
        (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,), dilation=(9,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
          )
        )
      )
    )
    (3): Qwen3OmniMoeCode2WavDecoderBlock(
      (block): ModuleList(
        (0): SnakeBeta()
        (1): Qwen3OmniMoeCausalTransConvNet(
          (conv): ConvTranspose1d(384, 192, kernel_size=(8,), stride=(4,))
        )
        (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
          )
        )
        (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,), dilation=(3,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
          )
        )
        (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,), dilation=(9,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
          )
        )
      )
    )
    (4): Qwen3OmniMoeCode2WavDecoderBlock(
      (block): ModuleList(
        (0): SnakeBeta()
        (1): Qwen3OmniMoeCausalTransConvNet(
          (conv): ConvTranspose1d(192, 96, kernel_size=(6,), stride=(3,))
        )
        (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
          )
        )
        (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,), dilation=(3,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
          )
        )
        (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
          (act1): SnakeBeta()
          (conv1): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,), dilation=(9,))
          )
          (act2): SnakeBeta()
          (conv2): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
          )
        )
      )
    )
    (5): SnakeBeta()
    (6): Qwen3OmniMoeCausalConvNet(
      (conv): Conv1d(96, 1, kernel_size=(7,), stride=(1,))
    )
  )
)

Model Neural Network Architecture

Qwen3OmniMoeForConditionalGeneration 35259.818545 M parameters

Qwen3OmniMoeForConditionalGeneration(
  (thinker): Qwen3OmniMoeThinkerForConditionalGeneration(
    (audio_tower): Qwen3OmniMoeAudioEncoder(
      (positional_embedding): SinusoidsPositionEmbedding()
      (layers): ModuleList(
        (0-31): 32 x Qwen3OmniMoeAudioEncoderLayer(
          (self_attn): Qwen3OmniMoeAudioAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bias=True)
          (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        )
      )
      (ln_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (conv2d1): Conv2d(1, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv2d2): Conv2d(480, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv2d3): Conv2d(480, 480, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (conv_out): Linear(in_features=7680, out_features=1280, bias=False)
      (proj1): Linear(in_features=1280, out_features=1280, bias=True)
      (act): GELUActivation()
      (proj2): Linear(in_features=1280, out_features=2048, bias=True)
    )
    (visual): Qwen3OmniMoeVisionEncoder(
      (merger_list): ModuleList(
        (0-2): 3 x Qwen3OmniMoeVisionPatchMerger(
          (ln_q): LayerNorm((4608,), eps=1e-06, elementwise_affine=True)
          (mlp): ModuleList(
            (0): Linear(in_features=4608, out_features=4608, bias=True)
            (1): GELU(approximate='none')
            (2): Linear(in_features=4608, out_features=2048, bias=True)
          )
        )
      )
      (patch_embed): Qwen3OmniMoeVisionPatchEmbed(
        (proj): Conv3d(3, 1152, kernel_size=(2, 16, 16), stride=(2, 16, 16))
      )
      (pos_embed): Embedding(2304, 1152)
      (rotary_pos_emb): Qwen3OmniMoeVisionRotaryEmbedding()
      (blocks): ModuleList(
        (0-26): 27 x Qwen3OmniMoeVisionBlock(
          (norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          (norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          (attn): Qwen3OmniMoeVisionAttention(
            (qkv): Linear(in_features=1152, out_features=3456, bias=True)
            (proj): Linear(in_features=1152, out_features=1152, bias=True)
          )
          (mlp): Qwen3OmniMoeVisionMLP(
            (linear_fc1): Linear(in_features=1152, out_features=4304, bias=True)
            (linear_fc2): Linear(in_features=4304, out_features=1152, bias=True)
            (act_fn): PytorchGELUTanh()
          )
        )
      )
      (merger): Qwen3OmniMoeVisionPatchMerger(
        (ln_q): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
        (mlp): ModuleList(
          (0): Linear(in_features=4608, out_features=4608, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=4608, out_features=2048, bias=True)
        )
      )
    )
    (model): Qwen3OmniMoeThinkerTextModel(
      (embed_tokens): Embedding(152064, 2048)
      (layers): ModuleList(
        (0-47): 48 x Qwen3OmniMoeThinkerTextDecoderLayer(
          (self_attn): Qwen3OmniMoeThinkerTextAttention(
            (q_proj): Linear(in_features=2048, out_features=4096, bias=False)
            (k_proj): Linear(in_features=2048, out_features=512, bias=False)
            (v_proj): Linear(in_features=2048, out_features=512, bias=False)
            (o_proj): Linear(in_features=4096, out_features=2048, bias=False)
            (q_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
            (k_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
          )
          (mlp): Qwen3OmniMoeThinkerTextSparseMoeBlock(
            (gate): Linear(in_features=2048, out_features=128, bias=False)
            (experts): ModuleList(
              (0-127): 128 x Qwen3OmniMoeThinkerTextMLP(
                (gate_proj): Linear(in_features=2048, out_features=768, bias=False)
                (up_proj): Linear(in_features=2048, out_features=768, bias=False)
                (down_proj): Linear(in_features=768, out_features=2048, bias=False)
                (act_fn): SiLU()
              )
            )
          )
          (input_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((2048,), eps=1e-06)
          (post_attention_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((2048,), eps=1e-06)
        )
      )
      (norm): Qwen3OmniMoeTextRMSNorm((2048,), eps=1e-06)
      (rotary_emb): Qwen3OmniMoeThinkerTextRotaryEmbedding()
    )
    (lm_head): Linear(in_features=2048, out_features=152064, bias=False)
  )
  (talker): Qwen3OmniMoeTalkerForConditionalGeneration(
    (model): Qwen3OmniMoeTalkerModel(
      (layers): ModuleList(
        (0-19): 20 x Qwen3OmniMoeTalkerDecoderLayer(
          (self_attn): Qwen3OmniMoeThinkerTextAttention(
            (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
            (k_proj): Linear(in_features=1024, out_features=256, bias=False)
            (v_proj): Linear(in_features=1024, out_features=256, bias=False)
            (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
            (q_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
            (k_norm): Qwen3OmniMoeThinkerTextRMSNorm((128,), eps=1e-06)
          )
          (mlp): Qwen3OmniMoeTalkerTextSparseMoeBlock(
            (gate): Linear(in_features=1024, out_features=128, bias=False)
            (experts): ModuleList(
              (0-127): 128 x Qwen3OmniMoeTalkerTextMLP(
                (gate_proj): Linear(in_features=1024, out_features=384, bias=False)
                (up_proj): Linear(in_features=1024, out_features=384, bias=False)
                (down_proj): Linear(in_features=384, out_features=1024, bias=False)
                (act_fn): SiLU()
              )
            )
            (shared_expert): Qwen3OmniMoeTalkerTextMLP(
              (gate_proj): Linear(in_features=1024, out_features=768, bias=False)
              (up_proj): Linear(in_features=1024, out_features=768, bias=False)
              (down_proj): Linear(in_features=768, out_features=1024, bias=False)
              (act_fn): SiLU()
            )
            (shared_expert_gate): Linear(in_features=1024, out_features=1, bias=False)
          )
          (input_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((1024,), eps=1e-06)
          (post_attention_layernorm): Qwen3OmniMoeThinkerTextRMSNorm((1024,), eps=1e-06)
        )
      )
      (norm): Qwen3OmniMoeTextRMSNorm((1024,), eps=1e-06)
      (rotary_emb): Qwen3OmniMoeTalkerRotaryEmbedding()
      (codec_embedding): Embedding(3072, 1024)
    )
    (text_projection): Qwen3OmniMoeTalkerResizeMLP(
      (linear_fc1): Linear(in_features=2048, out_features=2048, bias=True)
      (linear_fc2): Linear(in_features=2048, out_features=1024, bias=True)
      (act_fn): SiLU()
    )
    (hidden_projection): Qwen3OmniMoeTalkerResizeMLP(
      (linear_fc1): Linear(in_features=2048, out_features=2048, bias=True)
      (linear_fc2): Linear(in_features=2048, out_features=1024, bias=True)
      (act_fn): SiLU()
    )
    (codec_head): Linear(in_features=1024, out_features=3072, bias=False)
    (code_predictor): Qwen3OmniMoeTalkerCodePredictorModelForConditionalGeneration(
      (model): Qwen3OmniMoeTalkerCodePredictorModel(
        (layers): ModuleList(
          (0-4): 5 x Qwen3OmniMoeTalkerCodePredictorDecoderLayer(
            (self_attn): Qwen3OmniMoeTalkerCodePredictorAttention(
              (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
              (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
              (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
              (q_norm): Qwen3OmniMoeRMSNorm((128,), eps=1e-06)
              (k_norm): Qwen3OmniMoeRMSNorm((128,), eps=1e-06)
            )
            (mlp): Qwen3OmniMoeMLP(
              (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
              (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
              (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
            (post_attention_layernorm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
          )
        )
        (norm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-06)
        (rotary_emb): Qwen3OmniMoeRotaryEmbedding()
        (codec_embedding): ModuleList(
          (0-14): 15 x Embedding(2048, 1024)
        )
      )
      (lm_head): ModuleList(
        (0-14): 15 x Linear(in_features=1024, out_features=2048, bias=False)
      )
    )
  )
  (code2wav): Qwen3OmniMoeCode2Wav(
    (pre_transformer): Qwen3OmniMoeCode2WavTransformerModel(
      (layers): ModuleList(
        (0-7): 8 x Qwen3OmniMoeCode2WavTransformerLayer(
          (self_attn): Qwen3OmniMoeCode2WavAttention(
            (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (q_norm): Identity()
            (k_norm): Identity()
          )
          (mlp): Qwen3OmniMoeCode2WavMlp(
            (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
            (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
            (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen3OmniMoeCode2WavRMSNorm((1024,), eps=1e-05)
          (post_attention_layernorm): Qwen3OmniMoeCode2WavRMSNorm((1024,), eps=1e-05)
          (self_attn_layer_scale): Qwen3OmniMoeCode2WavLayerScale()
          (mlp_layer_scale): Qwen3OmniMoeCode2WavLayerScale()
        )
      )
      (norm): Qwen3OmniMoeRMSNorm((1024,), eps=1e-05)
      (rotary_emb): Qwen3OmniMoeRotaryEmbedding()
    )
    (code_embedding): Embedding(32768, 1024)
    (upsample): ModuleList(
      (0-1): 2 x ModuleList(
        (0): Qwen3OmniMoeCausalTransConvNet(
          (conv): ConvTranspose1d(1024, 1024, kernel_size=(2,), stride=(2,))
        )
        (1): Qwen3OmniMoeConvNeXtBlock(
          (dwconv): Qwen3OmniMoeCausalConvNet(
            (conv): Conv1d(1024, 1024, kernel_size=(7,), stride=(1,), groups=1024)
          )
          (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (pwconv1): Linear(in_features=1024, out_features=4096, bias=True)
          (act): GELU(approximate='none')
          (pwconv2): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (decoder): ModuleList(
      (0): Qwen3OmniMoeCausalConvNet(
        (conv): Conv1d(1024, 1536, kernel_size=(7,), stride=(1,))
      )
      (1): Qwen3OmniMoeCode2WavDecoderBlock(
        (block): ModuleList(
          (0): SnakeBeta()
          (1): Qwen3OmniMoeCausalTransConvNet(
            (conv): ConvTranspose1d(1536, 768, kernel_size=(16,), stride=(8,))
          )
          (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
            )
          )
          (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,), dilation=(3,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
            )
          )
          (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(7,), stride=(1,), dilation=(9,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(768, 768, kernel_size=(1,), stride=(1,))
            )
          )
        )
      )
      (2): Qwen3OmniMoeCode2WavDecoderBlock(
        (block): ModuleList(
          (0): SnakeBeta()
          (1): Qwen3OmniMoeCausalTransConvNet(
            (conv): ConvTranspose1d(768, 384, kernel_size=(10,), stride=(5,))
          )
          (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
            )
          )
          (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,), dilation=(3,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
            )
          )
          (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(7,), stride=(1,), dilation=(9,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(384, 384, kernel_size=(1,), stride=(1,))
            )
          )
        )
      )
      (3): Qwen3OmniMoeCode2WavDecoderBlock(
        (block): ModuleList(
          (0): SnakeBeta()
          (1): Qwen3OmniMoeCausalTransConvNet(
            (conv): ConvTranspose1d(384, 192, kernel_size=(8,), stride=(4,))
          )
          (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
            )
          )
          (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,), dilation=(3,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
            )
          )
          (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(7,), stride=(1,), dilation=(9,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
            )
          )
        )
      )
      (4): Qwen3OmniMoeCode2WavDecoderBlock(
        (block): ModuleList(
          (0): SnakeBeta()
          (1): Qwen3OmniMoeCausalTransConvNet(
            (conv): ConvTranspose1d(192, 96, kernel_size=(6,), stride=(3,))
          )
          (2): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
            )
          )
          (3): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,), dilation=(3,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
            )
          )
          (4): Qwen3OmniMoeCode2WavDecoderResidualUnit(
            (act1): SnakeBeta()
            (conv1): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(7,), stride=(1,), dilation=(9,))
            )
            (act2): SnakeBeta()
            (conv2): Qwen3OmniMoeCausalConvNet(
              (conv): Conv1d(96, 96, kernel_size=(1,), stride=(1,))
            )
          )
        )
      )
      (5): SnakeBeta()
      (6): Qwen3OmniMoeCausalConvNet(
        (conv): Conv1d(96, 1, kernel_size=(7,), stride=(1,))
      )
    )
  )
)

⭐️ Designs for Streaming and Concurrency

image image image

Reference

Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Qwen3Omni model's capabilities by introducing streaming generation for real-time interactions and integrating it with vLLM for optimized inference. It expands the existing multimodal functionalities with new test cases, ensuring robust support for diverse applications involving text, audio, image, and video inputs and outputs.

Highlights

  • Streaming Generation for Qwen3Omni: Implemented chunked streaming for the Qwen3Omni model within the transformers framework, allowing for real-time text and audio output. This includes new methods like thinker_generate_chunk and generate_stream.
  • vLLM Integration for Qwen3Omni: Added support for running Qwen3Omni models using a forked vLLM, optimizing inference performance for various multimodal tasks.
  • Comprehensive Modal Test Cases: Introduced a wide array of Modal test cases covering various multimodal interactions (audio, image, video, text) for both standard and new streaming functionalities, ensuring robust support.
  • EOS Token ID Correction: Updated the End-of-Sequence (EOS) token IDs in the Qwen2.5 Omni model configuration from 151644 to 151643 for improved tokenization and generation control.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@weedge weedge added streaming Omni Omni Modality transformers qwen3 base bone model: qwen3 labels Sep 26, 2025
@weedge
Copy link
Collaborator Author

weedge commented Sep 26, 2025

vllm & MLLMs

  1. https://www.youtube.com/watch?v=N89ES-xUjlE (Roblox's Journey to Supporting Multimodality on vLLM | Ray Summit 2024)
  2. https://www.youtube.com/watch?v=glXOTyBfjS8 (vLLM Office Hours - Multimodal Models in vLLM with Roblox the same as 1)
  3. https://www.youtube.com/watch?v=-WqJ8PbnFg8&t=1195s (Multimodal LLMs With vLLM v1)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Modal applications for running the Qwen3-Omni model, including a new chunked streaming implementation using transformers and an alternative setup using a forked version of vllm.

The transformers implementation in qwen3_omni.py adds valuable streaming capabilities. However, I've identified a critical performance issue in the thinker_generate_chunk method due to inefficient use of the KV cache, which should be addressed. Additionally, there's some dead code and extensive use of print statements that could be improved.

The vllm implementation in qwen3_omni_fork.py has several issues that need attention. It depends on a personal fork of vllm, which is a maintenance risk. More importantly, it has a critical bug in its entrypoint and appears to have incomplete support for audio generation, which affects many of the provided example tasks.

The minor change to qwen2_5_omni.py is a correct bug fix. My review focuses on improving the new Modal applications for robustness and performance.

Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
@weedge weedge merged commit f30901f into main Sep 28, 2025
@weedge weedge changed the title feat: add transformers Qwen3Omni thinker chunk stream feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot Sep 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Omni Omni Modality qwen3 base bone model: qwen3 streaming transformers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant