-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Batching Support For Multiple Audios #1302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…fferenciate between batching inside audio and multible audios
|
Hello Mahmoud, the idea itself sounds good and the current implementation works but I have a few concerns, the new function is basically a duplicated offset = 0
clip_timestamps = []
audio = np.array([])
for audio_path in audio_pathes:
clip = decode_audio(audio_path)
clip_timestamps.append(
{"start": offset / 16000, "end": (offset + len(clip)) / 16000}
)
audio = np.concatenate((audio, clip))
offset += len(clip)I'm also open to discussion if you have a better idea |
|
Hi Mahmoud, Thanks for your detailed response Yes I thought about it, but it was easier for me to duplicate because I had a task, and I wanted to do it as soon as possible and I published the branch for anyone else to use. But I will redo it to be cleaner, and I will be pleased to continue the discussion with you about it |
|
I also thought of doing a hybrid approach => multiple audios, multiple batches |
|
it can be generalized to allow for audios that are not necessarily shorter than 30s, you'll segment the audio using VAD and concatenate all segments across the batch dimension and continue as if it was a single audio and split after transcription, this is beneficial for very large batch sizes where the last batch will not be fully occupied |
|
yeah thats thats what i mean with hybrid approach I am working on it now |
|
Hi, it would be awesome to support this feature. |
|
Hi everyone, thank you for the discussion. Also, why is using clip_timestamps necessary? Thank you and best regards |
not at all, this approach is valid for files less than 30s only, and by using clip timestamps we can separate the different files from each other and also process them in parallel, independently from each other |
|
Hi! It looks like this PR has stalled. I’d like to pick it up and continue implementing batching support for multiple audios. Would it be okay if I open a new PR referencing this one? I would continue where @Mahmoud-ghareeb left off. For my use case, I have a bunch of audio files which are < 30 s. I want to batch transcribe them on the GPU, but I still want to run VAD on each individual audio file in parallel. I know that there are already whisper models that exist for multiple audio batching, but I'd much rather use this repo for its speed, the fact that VAD is available, and because the source code is much simpler to understand and modify (as compared to HF's many layers of abstraction). |
|
Yes feel free to open a PR that continues this one |
This work continues where SYSTRAN#1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput.
This work continues where SYSTRAN#1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput.
This work continues where SYSTRAN#1302 left off. The goal is to transcribe multiple audio files truly in parallel and increase GPU throughput. For more information please refer to the pull request
…nal transcribe function.
|
Extended BatchedInferencePipeline.transcribe() to accept a list of multiple audio inputs, enabling batch transcription of multiple audio files in a single call with GPU-parallel inference. Examplemodel = WhisperModel("tiny") Single audio (unchanged) Multiple audios (new) DeprecationAdded deprecation warning to transcribe_batch_multiple_audios() - users should migrate to transcribe() with a list. Testing
@MahmoudAshraf97 i also merged the latest changes so if everything is ok, merge it please. |
Hello everyone,
This PR supports batching multiple audio files together for inference, based on the existing batching mechanism used within a single audio file.
Use case: This enables more efficient GPU utilization and higher throughput when performing inference on multiple small audio files, which would otherwise be processed sequentially and underutilize hardware resources.
Supports [audio_path, np.array, BinaryIO]
Example: