Conversation
Signed-off-by: weedge <weege007@gmail.com>
There was a problem hiding this comment.
Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces VITA-Audio to the project, including necessary dependencies and a modal app for running audio-related tasks. It adds a new submodule for VITA-Audio and sets up the environment for tasks like speech-to-speech (s2s), automatic speech recognition (asr), and text-to-speech (tts). The modal app configuration includes GPU setup, model loading, and task execution functions.
Highlights
- Submodule Addition: The
.gitmodulesfile is updated to include VITA-Audio as a submodule, pointing to the weedge/VITA-Audio repository. - Modal App Implementation: A new modal app,
vita_voice.py, is created to handle VITA-Audio tasks. It includes image configuration with CUDA, dependency installation, and task-specific functions. - Dependency Management: The modal image installs necessary dependencies such as
git,ffmpeg,torch,torchaudio, andflash-attnto support audio processing and model execution. - Task Execution Functions: The
vita_voice.pyscript defines functions for tokenizing, model dumping, and benchmarking various audio tasks (STS, ASR, TTS) with streaming and non-streaming options. - Inference Class: The
S2SInferenceclass is implemented to handle sequence-to-sequence inference, including loading models, tokenizers, and running different audio processing pipelines.
Changelog
- .gitmodules
- Added VITA-Audio as a submodule.
- deploy/modal/src/llm/transformers/vita_voice.py
- Created a new modal app for VITA-Audio tasks.
- Configured the modal image with CUDA and necessary dependencies.
- Implemented functions for tokenizing, model dumping, and benchmarking audio tasks.
- Defined the
S2SInferenceclass for handling sequence-to-sequence inference.
- deps/VITAAudio
- Added a file that represents the VITA-Audio submodule commit.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A voice from the code,
Transforms sound into words,
LLM sings softly.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
text token和audio token 交替生成
inference with mtp按照
Note Qwen2DecoderLayer 越往后的层,对应生成hidden stats 包含的语义信息越好。(对于训练一个宽度+深度,参数量大的网络模型,使用mtp的效果会更好些) 代码:
Qwen2MTP_AR_LLM_Boosttraining or inference forward step: https://huggingface.co/VITA-MLLM/VITA-Audio-Boost/blob/main/modeling_qwen2.py#L781 Qwen2MTP_AR_LLM_Boost 10317.912576 M parameters (qwen2 + 10 mtp (Linear(FC) projs + embed_norms + mtp_hidden_norms) Qwen2MTP_AR_LLM_Balancetraining or inference forward step: https://huggingface.co/VITA-MLLM/VITA-Audio-Balance/blob/main/modeling_qwen2.py#L781 Qwen2MTP_AR_LLM_Balance 10317.912576 M parameters (qwen2 + 10 mtp (Linear(FC) projs + embed_norms + mtp_hidden_norms) Qwen2MTP_AR_LLM_Vanillatraining or inference forward step: https://huggingface.co/VITA-MLLM/VITA-Audio-Plus-Vanilla/blob/main/modeling_qwen2.py#L834 Qwen2MTP_AR_LLM_Vanilla 7979.042111 M parameters (sensevoice small encoder(no use ctc header) + mlp(ResamplerProjector) + qwen2) no mtp |
|
glm4voice tokenizer + glm4voice decoder THUDM/glm-4-voice-tokenizer: THUDM/glm-4-voice-decoder: THUDM/glm-4-voice-decoder: |
…evoice_glm4voice(sensevoice WavFrontend, decoder) Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
…ita_audio_asr llm_transformers_manual_vita_tts llm_transformers_manual_vita_text_voice llm_transformers_manual_vita_voice Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
Signed-off-by: weedge <weege007@gmail.com>
|
training script args |
|
deepspeed args: {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 6e-05,
"betas": [0.9, 0.95],
"eps": 1e-08,
"weight_decay": 0.0
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params": {
"total_num_steps": 8.000000e+03,
"warmup_min_ratio": 0,
"warmup_num_steps": 240,
"cos_min_ratio": 0.1
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true,
"round_robin_gradients": true,
"sub_group_size": 1.000000e+12
},
"gradient_accumulation_steps": 16,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false,
"dump_state": false
}要进一步优化DeepSpeed Zero优化器的内存使用效率,可以考虑以下几个方面:
ZeRO Stage 1:适用于中小模型,内存需求适中。 增大批次大小(通过 train_batch_size 或 gradient_accumulation_steps)可提高GPU利用率。 使用半精度(FP16)或混合精度(BF16)来表示模型参数和计算,从而减少内存占用和加速计算。 确保集群网络带宽充足(推荐InfiniBand或高性能以太网)。 如果GPU内存不足,启用优化器和参数卸载("offload_optimizer" 和 "offload_param")。 对于深层模型,启用激活检查点("activation_checkpointing": true)可显著减少内存占用。 根据模型规模和硬件资源,选择合适的ZeRO阶段。 ZeRO-2: 划分优化器参数与梯度,每个GPU各有一份完整的模型参数。 ZeRO-3: 划分优化器参数、梯度与模型参数。 简单来说:从 ZeRO-1 到 ZeRO-3,阶段数越高,显存需求越小,但是训练速度也依次变慢。此外,设置 offload_param=cpu 参数会大幅减小显存需求,但会极大地使训练速度减慢。因此,如果您有足够的显存, 应当使用 ZeRO-1,并且确保 offload_param=none。 在DeepSpeed中,ZeRO(Zero Redundancy Optimizer)通过三个阶段来优化内存使用,每个阶段都有其特定的目标和实现方式。以下是对ZeRO的三个阶段的详细说明: ZeRO Stage 1: 优化器状态分区
ZeRO Stage 2: 优化器+梯度状态分区
ZeRO Stage 3: 优化器+梯度+参数分区
具体见: https://weedge.github.io/post/llm/trainingparallelstrategy/ |



VITA-Audio 由四个主要组件组成:音频编码器、音频解码器、大型语言模型主干和一组跨模态token预测 (MCTP) 模块。Boost/Balance 与 GLM-4-Voice: Towards Intelligent and Human-like End-To-End Spoken Chatbot 类似,但增加了mtp, 使用 CosyVoice: A Scalable Multilingual Zero-Shot Text-To-Speech Synthesizer Based on Supervised 作为音频编码器和解码器。音频信号首先通过音频编码器编码为一系列离散的音频token,然后输入到 LLM 进行处理。在每次前向传递过程中,LLM 交替生成文本和音频token。LLM 最后一层的隐藏状态以及预测token的嵌入作为 MCTP 模块的输入。历史输入token、LLM 和 MCTP 模块预测的token被连接起来,形成下一个 LLM 前向传递的输入。最后,LLM 和 MCTP 模块生成的音频token被聚合并传递给音频解码器以生成最终的音频输出。
训练: #146 (comment)
推理流程分析见: #146 (comment)
feat:
LLM_MODEL_NAME_OR_PATH=./models/VITA-MLLM/VITA-Audio-Plus-Vanilla \ SENSE_VOICE_MODEL_PATH=./models/FunAudioLLM/SenseVoiceSmall \ AUDIO_TOKENIZER_TYPE=sensevoice_glm4voice \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 \ python -m unittest test.modules.speech.asr.test_vita_asr.TestVITAASR.test_transcribe_stream LLM_MODEL_NAME_OR_PATH=./models/VITA-MLLM/VITA-Audio-Plus-Vanilla \ SENSE_VOICE_MODEL_PATH=./models/FunAudioLLM/SenseVoiceSmall \ AUDIO_TOKENIZER_TYPE=sensevoice_glm4voice \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 LLM_ATTN_IMP=flash_attention_2 \ python -m unittest test.modules.speech.asr.test_vita_asr.TestVITAASR.test_transcribe_streamLLM_MODEL_NAME_OR_PATH=./models/VITA-MLLM/VITA-Audio-Plus-Vanilla \ AUDIO_TOKENIZER_TYPE=sensevoice_glm4voice \ FLOW_PATH=./models/THUDM/glm-4-voice-decoder \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 LLM_ATTN_IMP=flash_attention_2 \ python -m unittest test.modules.speech.tts.test_vita.TestVITATTS.test_synthesizeTTS_TAG=tts_vita IS_SAVE=1 \ LLM_MODEL_NAME_OR_PATH=./models/VITA-MLLM/VITA-Audio-Plus-Vanilla \ AUDIO_TOKENIZER_TYPE=sensevoice_glm4voice \ FLOW_PATH=./models/THUDM/glm-4-voice-decoder \ LLM_DEVICE=cuda LLM_TORCH_DTYPE=bfloat16 LLM_ATTN_IMP=flash_attention_2 \ python -m src.cmd.grpc.speaker.client# run fastapi_webrtc_vita_voice_bot_serve ACHATBOT_VERSION=0.0.11 IMAGE_CONCURRENT_CN=1 IMAGE_GPU=L40s modal serve src/fastapi_webrtc_vita_voice_bot_serve.pyspeech -> text + speech | use Plus-Vanilla qwen2_mtp_sensevoice no mtp
speech -> text + speech | use Balance/Boost qwen2_mtp LLM with mtp
{ "chat_bot_name": "LivekitVitaVoiceBot", "room_name": "chat-room", "room_url": "", "token": "", "room_manager": { "tag": "livekit_room", "args": { "bot_name": "LivekitVitaVoiceBot", "is_common_session": false } }, "services": { "pipeline": "achatbot", "vad": "silero", "voice_llm": "llm_transformers_manual_vita_voice" }, "config": { "vad": { "tag": "silero_vad_analyzer", "args": { "stop_secs": 0.7 } }, "voice_llm": { "tag": "llm_transformers_manual_vita_voice", "args": { "no_stream_sleep_time": 0.5, "lm_device": "cuda", "lm_torch_dtype": "bfloat16", "lm_attn_impl": "flash_attention_2", "warmup_steps": 1, "chat_history_size": 0, "audio_tokenizer_type": "glm4voice", "audio_tokenizer_model_path": "/root/.achatbot/models/THUDM/glm-4-voice-tokenizer", "sense_voice_model_path": null, "flow_path": "/root/.achatbot/models/THUDM/glm-4-voice-decoder", "audio_tokenizer_rank": 0, "chunk_size_list": [ 8, 16, 25, 50, 100, 150, 200 ], "lm_model_name_or_path": "/root/.achatbot/models/VITA-MLLM/VITA-Audio-Balance" } } }, "config_list": [] }asr + text -> text+speech | use qwen2_mtp_sensevoice LLM no mtp
asr + text -> text+speech | use Balance/Boost qwen2_mtp LLM
{ "chat_bot_name": "LivekitAsrVITAVoiceBot", "room_name": "chat-room", "room_url": "", "token": "", "room_manager": { "tag": "livekit_room", "args": { "bot_name": "LivekitAsrVITAVoiceBot", "is_common_session": false } }, "services": { "pipeline": "achatbot", "vad": "silero", "asr": "sense_voice_asr", "voice_llm": "llm_transformers_manual_vita_text_voice" }, "config": { "vad": { "tag": "silero_vad_analyzer", "args": { "stop_secs": 0.7 } }, "asr": { "args": { "language": "zn", "model_name_or_path": "/root/.achatbot/models/FunAudioLLM/SenseVoiceSmall" }, "tag": "sense_voice_asr" }, "voice_llm": { "tag": "llm_transformers_manual_vita_voice", "args": { "no_stream_sleep_time": 0.5, "lm_device": "cuda", "lm_torch_dtype": "bfloat16", "lm_attn_impl": "flash_attention_2", "warmup_steps": 1, "chat_history_size": 0, "audio_tokenizer_type": "glm4voice", "audio_tokenizer_model_path": null, "sense_voice_model_path": null, "flow_path": "/root/.achatbot/models/THUDM/glm-4-voice-decoder", "audio_tokenizer_rank": 0, "chunk_size_list": [ 8, 16, 25, 50, 100, 150, 200 ], "lm_model_name_or_path": "/root/.achatbot/models/VITA-MLLM/VITA-Audio-Balance" } } }, "config_list": [] }paper ai podcast: https://podcast-997.pages.dev/podcast/81f84087d1884716b2d6acec1273ccb1
colab: https://github.com/weedge/doraemon-nb/blob/main/achatbot_vita_audio.ipynb
Note
Tip
目标: (公开的训练和推理源码,应该是开源了中间过程,还可以加入数据集继续训练, 采用不同的audio tokenizer(encode) 和 llm 训练mtp模型(对于参数量大的模型比如dsv3可以使用LoRA的方式训练mtp模型), 但整体框架差不多)
llm 分为: 稠密模型 和 稀疏模型(moe)
audio tokenizer 分为:
工程化:
reference
mtp: