feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot#196
feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot#196
Conversation
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Summary of ChangesHello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Qwen3Omni model's capabilities by introducing streaming generation for real-time interactions and integrating it with vLLM for optimized inference. It expands the existing multimodal functionalities with new test cases, ensuring robust support for diverse applications involving text, audio, image, and video inputs and outputs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
vllm & MLLMs
|
There was a problem hiding this comment.
Code Review
This pull request introduces Modal applications for running the Qwen3-Omni model, including a new chunked streaming implementation using transformers and an alternative setup using a forked version of vllm.
The transformers implementation in qwen3_omni.py adds valuable streaming capabilities. However, I've identified a critical performance issue in the thinker_generate_chunk method due to inefficient use of the KV cache, which should be addressed. Additionally, there's some dead code and extensive use of print statements that could be improved.
The vllm implementation in qwen3_omni_fork.py has several issues that need attention. It depends on a personal fork of vllm, which is a maintenance risk. More importantly, it has a critical bug in its entrypoint and appears to have incomplete support for audio generation, which affects many of the provided example tasks.
The minor change to qwen2_5_omni.py is a correct bug fix. My review focuses on improving the new Modal applications for robustness and performance.
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Tip
* **优点**:这款饮料是专门为运动后设计的。它的核心成分是电解质,标签上明确写着“电解质≥200mg”,这能有效补充运动时因大量出汗流失的钠、钾等矿物质,帮助维持体液平衡,防止抽筋。同时,它也含有维生素E和维生素B6,有助于能量代谢。它还是0糖0卡的,不用担心额外的热量。execute_model方法, 比如:GPUModelRunner.execute_model 方法,获取注册HF config.json中的Qwen3OmniMoeForConditionalGeneration类(已支持多模态输入理解的thinker),进行推理(forward),其中MoE LM的推理使用torch.compile进行了编译优化见(启动时需要预热)代码实现feat:
livekit_qwen3omni_vision_voice_bot.json
{ "chat_bot_name": "LivekitQwen3OmniVisionVoiceBot", "handle_sigint": true, "is_background": false, "save_audio": true, "room_name": "chat-room", "room_url": "", "token": "", "room_manager": { "tag": "livekit_room", "args": { "bot_name": "LivekitQwen3OmniVisionVoiceBot", "is_common_session": false } }, "services": { "pipeline": "achatbot", "vad": "silero", "omni_llm": "llm_transformers_manual_qwen3omni_vision_voice" }, "config": { "vad": { "tag": "silero_vad_analyzer", "args": { "start_secs": 0.032, "stop_secs": 0.32, "confidence": 0.7, "min_volume": 0.6, "onnx": true } }, "omni_llm": { "processor": "Qwen3OmnVisionVoiceProcessor", "tag": "llm_transformers_manual_qwen3omni_vision_voice", "args": { "no_stream_sleep_time": 0.0, "lm_device": "cuda", "lm_torch_dtype": "bfloat16", "lm_attn_impl": "flash_attention_2", "warmup_steps": 2, "chat_history_size": 0, "thinker_eos_token_ids": [151643, 151645], "thinker_args": { "lm_gen_temperature": 0.95, "lm_gen_top_k": 20, "lm_gen_top_p": 0.9, "lm_gen_min_new_tokens": 1, "lm_gen_max_new_tokens": 1024, "lm_gen_max_tokens_per_step": 10, "lm_gen_repetition_penalty": 1.1 }, "talker_args": { "lm_gen_temperature": 0.95, "lm_gen_top_k": 20, "lm_gen_top_p": 0.9, "lm_gen_min_new_tokens": 1, "lm_gen_max_new_tokens": 2048, "lm_gen_repetition_penalty": 1.1 }, "code2wav_args": { "chunk_size": 50, "left_context_size": 25 }, "speaker": "Chelsie", "lm_model_name_or_path": "/root/.achatbot/models/Qwen/Qwen3-Omni-30B-A3B-Instruct" } } }, "config_list": [] }AI podcast:
HF Transformers qwen3_omni_moe : https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py
Thinker
Qwen3OmniMoeForConditionalGeneration.thinker 31719.205488 M parameters
Audio Encoder (AuT)
Talker
Qwen3OmniMoeForConditionalGeneration.talker 3324.59648 M parameters
Streaming Codec Decoder (code2wav)
Qwen3OmniMoeForConditionalGeneration.code2wav 216.016577 M parameters
Model Neural Network Architecture
Qwen3OmniMoeForConditionalGeneration 35259.818545 M parameters
⭐️ Designs for Streaming and Concurrency
Reference