Skip to content

Commit 91e866b

Browse files
committed
Add modal integration for STT model
1 parent 2dbca5e commit 91e866b

File tree

5 files changed

+125
-294
lines changed

5 files changed

+125
-294
lines changed

README.md

Lines changed: 28 additions & 293 deletions
Original file line numberDiff line numberDiff line change
@@ -1,316 +1,51 @@
1-
[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
1+
# Large V3 Faster Whisper Modal Deployment On Modal.com
22

3-
# Faster Whisper transcription with CTranslate2
3+
A FastAPI-based server that uses [Faster Whisper](https://github.com/guillaumekln/faster-whisper) for speech-to-text transcription, deployed on [modal.com](https://modal.com). This guide walks you through cloning, setting up, and deploying the server.
44

5-
**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.
5+
---
66

7-
This implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
7+
## Prerequisites
88

9-
## Benchmark
9+
- **Python 3.x**
10+
- **[Modal Account](https://modal.com)** for deployment
1011

11-
### Whisper
12+
---
1213

13-
For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
14+
## Installation Guide
1415

15-
* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
16-
* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
17-
* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
16+
### 1. Clone the Repository
1817

19-
### Large-v2 model on GPU
20-
21-
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
22-
| --- | --- | --- | --- | --- | --- |
23-
| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
24-
| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
25-
| faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB |
26-
27-
*Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.*
28-
29-
### Small model on CPU
30-
31-
| Implementation | Precision | Beam size | Time | Max. memory |
32-
| --- | --- | --- | --- | --- |
33-
| openai/whisper | fp32 | 5 | 10m31s | 3101MB |
34-
| whisper.cpp | fp32 | 5 | 17m42s | 1581MB |
35-
| whisper.cpp | fp16 | 5 | 12m39s | 873MB |
36-
| faster-whisper | fp32 | 5 | 2m44s | 1675MB |
37-
| faster-whisper | int8 | 5 | 2m04s | 995MB |
38-
39-
*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
40-
41-
42-
### Distil-whisper
43-
44-
| Implementation | Precision | Beam size | Time | Gigaspeech WER |
45-
| --- | --- | --- | --- | --- |
46-
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
47-
| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
48-
| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
49-
| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
50-
51-
*Executed with CUDA 11.4 on a NVIDIA 3090.*
52-
53-
<details>
54-
<summary>testing details (click to expand)</summary>
55-
56-
For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
57-
```python
58-
from faster_whisper import WhisperModel
59-
60-
model_size = "distil-large-v2"
61-
# model_size = "distil-medium.en"
62-
# Run on GPU with FP16
63-
model = WhisperModel(model_size, device="cuda", compute_type="float16")
64-
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
65-
```
66-
</details>
67-
68-
## Requirements
69-
70-
* Python 3.8 or greater
71-
72-
73-
### GPU
74-
75-
GPU execution requires the following NVIDIA libraries to be installed:
76-
77-
* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
78-
* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)
79-
80-
**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinstall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`).
81-
82-
There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.
83-
84-
<details>
85-
<summary>Other installation methods (click to expand)</summary>
86-
87-
88-
**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.
89-
90-
#### Use Docker
91-
92-
The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`.
93-
94-
#### Install with `pip` (Linux only)
95-
96-
On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.
18+
Clone the `faster-whisper-modal` repository to your local machine:
9719

9820
```bash
99-
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
100-
101-
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
102-
```
103-
104-
**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.
105-
106-
#### Download the libraries from Purfview's repository (Windows & Linux)
107-
108-
Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`.
109-
110-
</details>
111-
112-
## Installation
113-
114-
The module can be installed from [PyPI](https://pypi.org/project/faster-whisper/):
115-
116-
```bash
117-
pip install faster-whisper
21+
git clone https://github.com/SYSTRAN/faster-whisper.git
22+
cd faster-whisper-modal
11823
```
11924

120-
<details>
121-
<summary>Other installation methods (click to expand)</summary>
122-
123-
### Install the master branch
124-
125-
```bash
126-
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
127-
```
12825

129-
### Install a specific commit
26+
### 2. Install the Modal SDK
27+
Install the Modal SDK for deploying applications to the Modal cloud:
13028

13129
```bash
132-
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
133-
```
134-
135-
</details>
136-
137-
## Usage
138-
139-
### Faster-whisper
140-
141-
```python
142-
from faster_whisper import WhisperModel
143-
144-
model_size = "large-v3"
145-
146-
# Run on GPU with FP16
147-
model = WhisperModel(model_size, device="cuda", compute_type="float16")
148-
149-
# or run on GPU with INT8
150-
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
151-
# or run on CPU with INT8
152-
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
153-
154-
segments, info = model.transcribe("audio.mp3", beam_size=5)
155-
156-
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
157-
158-
for segment in segments:
159-
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
160-
```
161-
162-
**Warning:** `segments` is a *generator* so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a `for` loop:
163-
164-
```python
165-
segments, _ = model.transcribe("audio.mp3")
166-
segments = list(segments) # The transcription will actually run here.
167-
```
168-
169-
### multi-segment language detection
170-
171-
To directly use the model for improved language detection, the following code snippet can be used:
172-
173-
```python
174-
from faster_whisper import WhisperModel
175-
model = WhisperModel("medium", device="cuda", compute_type="float16")
176-
language_info = model.detect_language_multi_segment("audio.mp3")
177-
```
178-
179-
### Batched faster-whisper
180-
181-
The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.
182-
183-
```python
184-
from faster_whisper import WhisperModel, BatchedInferencePipeline
185-
186-
model = WhisperModel("medium", device="cuda", compute_type="float16")
187-
batched_model = BatchedInferencePipeline(model=model)
188-
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)
189-
190-
for segment in segments:
191-
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
192-
```
193-
194-
### Faster Distil-Whisper
195-
196-
The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
197-
checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet
198-
demonstrates how to run inference with distil-large-v3 on a specified audio file:
199-
200-
```python
201-
from faster_whisper import WhisperModel
202-
203-
model_size = "distil-large-v3"
204-
205-
model = WhisperModel(model_size, device="cuda", compute_type="float16")
206-
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
207-
208-
for segment in segments:
209-
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
210-
```
211-
212-
For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
213-
214-
### Word-level timestamps
215-
216-
```python
217-
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
218-
219-
for segment in segments:
220-
for word in segment.words:
221-
print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
30+
pip install modal
22231
```
22332

224-
### VAD filter
225-
226-
The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad) model to filter out parts of the audio without speech:
227-
228-
```python
229-
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
230-
```
231-
232-
The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
233-
234-
```python
235-
segments, _ = model.transcribe(
236-
"audio.mp3",
237-
vad_filter=True,
238-
vad_parameters=dict(min_silence_duration_ms=500),
239-
)
240-
```
241-
242-
### Logging
243-
244-
The library logging level can be configured like this:
245-
246-
```python
247-
import logging
248-
249-
logging.basicConfig()
250-
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
251-
```
252-
253-
### Going further
254-
255-
See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
256-
257-
## Community integrations
258-
259-
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
260-
261-
262-
* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
263-
* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
264-
* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
265-
* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
266-
* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS.
267-
* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
268-
* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
269-
* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
270-
* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
271-
* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
272-
* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
273-
* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
274-
275-
## Model conversion
276-
277-
When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
278-
279-
We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
280-
281-
For example the command below converts the [original "large-v3" Whisper model](https://huggingface.co/openai/whisper-large-v3) and saves the weights in FP16:
282-
33+
### 3. Setup the Modal
34+
Set up Modal authentication. This will open a browser window for you to authorize access to your Modal account:
28335
```bash
284-
pip install transformers[torch]>=4.23
285-
286-
ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
287-
--copy_files tokenizer.json preprocessor_config.json --quantization float16
36+
python3 -m modal setup
28837
```
28938

290-
* The option `--model` accepts a model name on the Hub or a path to a model directory.
291-
* If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.
292-
293-
Models can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html).
294-
295-
### Load a converted model
296-
297-
1. Directly load the model from a local directory:
298-
```python
299-
model = faster_whisper.WhisperModel("whisper-large-v3-ct2")
39+
### 4. Deploying the App on Modal
40+
Deploy the app on Modal and get the app link from terminal/Modal Dashboard
41+
```bash
42+
modal deploy app.py
30043
```
30144

302-
2. [Upload your model to the Hugging Face Hub](https://huggingface.co/docs/transformers/model_sharing#upload-with-the-web-interface) and load it from its name:
303-
```python
304-
model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")
305-
```
306-
307-
## Comparing performance against other implementations
308-
309-
If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:
45+
### 5. Test Deployed App:
46+
After the code is deployed, retrieve the app link from the Modal.com Dashboard. The app link will look similar to:
31047

311-
* Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, `model.transcribe` uses a default beam size of 1 but here we use a default beam size of 5.
312-
* When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable `OMP_NUM_THREADS`, which can be set when running your script:
313-
314-
```bash
315-
OMP_NUM_THREADS=4 python3 my_script.py
316-
```
48+
```bash
49+
curl --location 'https://your-name--faster-whisper-server-fastapi-wrapper.modal.run/transcribe' \
50+
--form 'file=@"/home/user/Desktop/locean-et-lhumanite-destins-lies-lamya-essemlali-tedxorleans-128-ytshorts.savetube.me.mp3"'
51+
```

0 commit comments

Comments
 (0)