Fun-ASR vLLM Acceleration

This repository provides an accelerated implementation of Fun-ASR using vLLM. By leveraging vLLM's efficient attention mechanisms and memory management, this project significantly boosts the inference performance of Fun-ASR models while maintaining accuracy.

Environment Setup 🐍

To get started, clone the repository and install the required dependencies:

git clone https://github.com/yuekaizhang/Fun-ASR-vllm.git
cd Fun-ASR-vllm
apt-get install -y ffmpeg
uv pip install -r requirements.txt

Features 📝

Support vLLM
Support batch > 1 Inference
Support FunAudioLLM/Fun-ASR-MLT-Nano-2512 and FunAudioLLM/Fun-ASR-Nano-2512
Integration with NVIDIA Triton Inference Server

Usage 🛠️

Python API Inference

You can run inference directly using the Python API:

from model import FunASRNano
from vllm import LLM, SamplingParams

def main():
    model_dir = "FunAudioLLM/Fun-ASR-Nano-2512"
    # Load the base model
    m, kwargs = FunASRNano.from_pretrained(model=model_dir, device="cuda:0")
    m.eval()
    
    # Initialize vLLM
    vllm = LLM(model="yuekai/Fun-ASR-Nano-2512-vllm", enable_prompt_embeds=True, gpu_memory_utilization=0.4)
    sampling_params = SamplingParams(
        top_p=0.001,
        max_tokens=500,
    )
    
    # Attach vLLM to the model
    m.vllm = vllm
    m.vllm_sampling_params = sampling_params

    # Run inference
    wav_path = f"{kwargs['model_path']}/example/zh.mp3"
    res = m.inference(data_in=[wav_path], **kwargs)
    print(res)
    text = res[0][0]["text"]
    print(text)


if __name__ == "__main__":
    main()

For multilingual FunAudioLLM/Fun-ASR-MLT-Nano-2512:

from model import FunASRNano
from vllm import LLM, SamplingParams

def main():
    model_dir = "FunAudioLLM/Fun-ASR-MLT-Nano-2512"
    # Load the base model
    m, kwargs = FunASRNano.from_pretrained(model=model_dir, device="cuda:0")
    m.eval()
    
    # Initialize vLLM
    vllm = LLM(model="yuekai/Fun-ASR-MLT-Nano-2512-vllm", enable_prompt_embeds=True, gpu_memory_utilization=0.4)
    sampling_params = SamplingParams(
        top_p=0.001,
        max_tokens=500,
    )
    
    # Attach vLLM to the model
    m.vllm = vllm
    m.vllm_sampling_params = sampling_params

    # Run inference
    wav_path = f"{kwargs['model_path']}/example/en.mp3"
    # 中文、英文、日文 for Fun-ASR-Nano-2512
    # 中文、英文、粤语、日文、韩文、越南语、印尼语、泰语、马来语、菲律宾语、阿拉伯语、
    # 印地语、保加利亚语、克罗地亚语、捷克语、丹麦语、荷兰语、爱沙尼亚语、芬兰语、希腊语、
    # 匈牙利语、爱尔兰语、拉脱维亚语、立陶宛语、马耳他语、波兰语、葡萄牙语、罗马尼亚语、
    # 斯洛伐克语、斯洛文尼亚语、瑞典语 for Fun-ASR-MLT-Nano-2512
    res = m.inference(data_in=[wav_path], language="英文", **kwargs)
    print(res)
    text = res[0][0]["text"]
    print(text)


if __name__ == "__main__":
    main()

Running Benchmarks

To evaluate performance on a dataset (e.g., SpeechIO):

dataset_name="yuekai/speechio"
subset_name="SPEECHIO_ASR_ZH00007"
split_name="test"

uv run python \
    infer.py \
    --model_dir FunAudioLLM/Fun-ASR-Nano-2512 \
    --huggingface_dataset $dataset_name \
    --subset_name $subset_name \
    --split_name $split_name \
    --batch_size 16 \
    --log_dir ./logs_vllm_$dataset_name_$subset_name \
    --vllm_model_dir yuekai/Fun-ASR-Nano-2512-vllm

To evaluate multilingual performance of FunAudioLLM/Fun-ASR-MLT-Nano-2512:

dataset_name="google/fleurs"
subset_name="en_us"
split_name="test"

uv run python \
    infer.py \
    --model_dir FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
    --huggingface_dataset $dataset_name \
    --subset_name $subset_name \
    --split_name $split_name \
    --batch_size 16 \
    --language "英文" \
    --log_dir ./logs_mlt_${batch_size}_${dataset_name}_${subset_name} \
    --vllm_model_dir yuekai/Fun-ASR-MLT-Nano-2512-vllm

Performance 🚀

We compared the performance of the standard HuggingFace PyTorch implementation against our vLLM-accelerated version.

Benchmark Details:

Dataset: SPEECHIO_ASR_ZH00007 (approx. 1 hour of audio)
Hardware: Single NVIDIA H20 GPU

Mode	Decoding Time	RTF	RTFx	CER	Note
Huggingface PyTorch	211.40 Secs	0.0587	17.03	7.02%	batch_size=1
Huggingface PyTorch	41.6 Secs	0.0116	86.54	8.53%	batch_size=16
vLLM (Qwen3-0.6B)	132.78 Secs	0.0369	27.11	6.99%	batch_size=1
vLLM (Qwen3-0.6B)	19.9 Secs	0.0055	180.90	7.03%	batch_size=16

Note: RTF (Real Time Factor) - lower is better; RTFx (Speedup factor) - higher is better.

Triton Inference Server Deployment 🚀

For production deployment with high concurrency, we provide integration with NVIDIA Triton Inference Server.

Quick Start

cd triton_server

# Using Docker Compose (recommended)
docker compose up

# Or pull and run the pre-built image
docker pull soar97/triton-fun-asr:25.06

Triton Performance

Dataset: SPEECHIO_ASR_ZH00007 (approx. 1 hour of audio)
Hardware: Single NVIDIA H20 GPU

Concurrency	CER	Processing Time	P50 Latency	RTF
8	7.04%	44.56s	450.99ms	0.0126
16	7.00%	27.96s	533.36ms	0.0079
32	7.07%	24.51s	952.93ms	0.0069

For detailed setup instructions, see triton_server/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
triton_server		triton_server
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
infer.py		infer.py
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fun-ASR vLLM Acceleration

Environment Setup 🐍

Features 📝

Usage 🛠️

Python API Inference

Running Benchmarks

Performance 🚀

Triton Inference Server Deployment 🚀

Quick Start

Triton Performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fun-ASR vLLM Acceleration

Environment Setup 🐍

Features 📝

Usage 🛠️

Python API Inference

Running Benchmarks

Performance 🚀

Triton Inference Server Deployment 🚀

Quick Start

Triton Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages