Skip to content

Conversation

@Mahmoud-ghareeb
Copy link

@Mahmoud-ghareeb Mahmoud-ghareeb commented May 20, 2025

Hello everyone,

This PR supports batching multiple audio files together for inference, based on the existing batching mechanism used within a single audio file.

Use case: This enables more efficient GPU utilization and higher throughput when performing inference on multiple small audio files, which would otherwise be processed sequentially and underutilize hardware resources.

Supports [audio_path, np.array, BinaryIO]

Example:

model = WhisperModel("tiny")
batched_model = BatchedInferencePipeline(model=model)

result, info = batched_model.transcribe_batch_multiple_audios(
    [physcisworks_path, physcisworks_path, physcisworks_path],
    batch_size=3
)

segments = []
for segment in result:
    segments.append(
        {"text": segment.text}
    )

@MahmoudAshraf97
Copy link
Collaborator

Hello Mahmoud, the idea itself sounds good and the current implementation works but I have a few concerns, the new function is basically a duplicated transcribe function which makes maintenance even harder since we already have two transcribe functions, what I would suggest to achieve the same functionality without any code changes is to load all the files and concatenate them and pass them to transcribe function with clip_timestamps and vad_filter=False, for example:

offset = 0
clip_timestamps = []
audio = np.array([])
for audio_path in audio_pathes:
    clip = decode_audio(audio_path)
    clip_timestamps.append(
        {"start": offset / 16000, "end": (offset + len(clip)) / 16000}
    )
    audio = np.concatenate((audio, clip))
    offset += len(clip)

I'm also open to discussion if you have a better idea

@Mahmoud-ghareeb
Copy link
Author

Hi Mahmoud,

Thanks for your detailed response

Yes I thought about it, but it was easier for me to duplicate because I had a task, and I wanted to do it as soon as possible and I published the branch for anyone else to use.

But I will redo it to be cleaner, and I will be pleased to continue the discussion with you about it

@Mahmoud-ghareeb
Copy link
Author

I also thought of doing a hybrid approach => multiple audios, multiple batches

@MahmoudAshraf97
Copy link
Collaborator

it can be generalized to allow for audios that are not necessarily shorter than 30s, you'll segment the audio using VAD and concatenate all segments across the batch dimension and continue as if it was a single audio and split after transcription, this is beneficial for very large batch sizes where the last batch will not be fully occupied

@Mahmoud-ghareeb
Copy link
Author

yeah thats thats what i mean with hybrid approach

I am working on it now

@Nixoals
Copy link

Nixoals commented Jun 25, 2025

Hi, it would be awesome to support this feature.

@egrinstein
Copy link

Hi everyone, thank you for the discussion.
@MahmoudAshraf97 , wouldn't your concatenation approach's complexity grow with the sum of the duration of all files? Because the transformer contexts would grow and grow, right?

Also, why is using clip_timestamps necessary?

Thank you and best regards

@MahmoudAshraf97
Copy link
Collaborator

Hi everyone, thank you for the discussion. @MahmoudAshraf97 , wouldn't your concatenation approach's complexity grow with the sum of the duration of all files? Because the transformer contexts would grow and grow, right?

Also, why is using clip_timestamps necessary?

Thank you and best regards

not at all, this approach is valid for files less than 30s only, and by using clip timestamps we can separate the different files from each other and also process them in parallel, independently from each other

@j-silv
Copy link

j-silv commented Sep 5, 2025

Hi! It looks like this PR has stalled. I’d like to pick it up and continue implementing batching support for multiple audios. Would it be okay if I open a new PR referencing this one? I would continue where @Mahmoud-ghareeb left off.

For my use case, I have a bunch of audio files which are < 30 s. I want to batch transcribe them on the GPU, but I still want to run VAD on each individual audio file in parallel.

I know that there are already whisper models that exist for multiple audio batching, but I'd much rather use this repo for its speed, the fact that VAD is available, and because the source code is much simpler to understand and modify (as compared to HF's many layers of abstraction).

@MahmoudAshraf97
Copy link
Collaborator

Yes feel free to open a PR that continues this one

j-silv added a commit to j-silv/faster-whisper that referenced this pull request Sep 8, 2025
This work continues where SYSTRAN#1302 left off. The goal is to
transcribe multiple audio files truly in parallel and increase
GPU throughput.
j-silv added a commit to j-silv/faster-whisper that referenced this pull request Sep 8, 2025
This work continues where SYSTRAN#1302 left off. The goal is to
transcribe multiple audio files truly in parallel and increase
GPU throughput.
j-silv added a commit to j-silv/faster-whisper that referenced this pull request Sep 8, 2025
This work continues where SYSTRAN#1302 left off. The goal is to
transcribe multiple audio files truly in parallel and increase
GPU throughput.

For more information please refer to the pull request
@Mahmoud-ghareeb
Copy link
Author

Mahmoud-ghareeb commented Dec 2, 2025

Extended BatchedInferencePipeline.transcribe() to accept a list of multiple audio inputs, enabling batch transcription of multiple audio files in a single call with GPU-parallel inference.
it works with any audio duration even > 30 sec

Example

model = WhisperModel("tiny")
batched_model = BatchedInferencePipeline(model=model)

Single audio (unchanged)
segments, info = batched_model.transcribe(audio)

Multiple audios (new)
segments, info = batched_model.transcribe([audio1, audio2, audio3])

Deprecation

Added deprecation warning to transcribe_batch_multiple_audios() - users should migrate to transcribe() with a list.

Testing

  • Added test_transcribe_multiple_audios
  • Added test_transcribe_multiple_audios_with_word_timestamps

@MahmoudAshraf97 i also merged the latest changes so if everything is ok, merge it please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants