|
1 | | -[](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [](https://badge.fury.io/py/faster-whisper) |
| 1 | +# Large V3 Faster Whisper Modal Deployment On Modal.com |
2 | 2 |
|
3 | | -# Faster Whisper transcription with CTranslate2 |
| 3 | +A FastAPI-based server that uses [Faster Whisper](https://github.com/guillaumekln/faster-whisper) for speech-to-text transcription, deployed on [modal.com](https://modal.com). This guide walks you through cloning, setting up, and deploying the server. |
4 | 4 |
|
5 | | -**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models. |
| 5 | +--- |
6 | 6 |
|
7 | | -This implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. |
| 7 | +## Prerequisites |
8 | 8 |
|
9 | | -## Benchmark |
| 9 | +- **Python 3.x** |
| 10 | +- **[Modal Account](https://modal.com)** for deployment |
10 | 11 |
|
11 | | -### Whisper |
| 12 | +--- |
12 | 13 |
|
13 | | -For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations: |
| 14 | +## Installation Guide |
14 | 15 |
|
15 | | -* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258) |
16 | | -* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362) |
17 | | -* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e) |
| 16 | +### 1. Clone the Repository |
18 | 17 |
|
19 | | -### Large-v2 model on GPU |
20 | | - |
21 | | -| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory | |
22 | | -| --- | --- | --- | --- | --- | --- | |
23 | | -| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB | |
24 | | -| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB | |
25 | | -| faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB | |
26 | | - |
27 | | -*Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.* |
28 | | - |
29 | | -### Small model on CPU |
30 | | - |
31 | | -| Implementation | Precision | Beam size | Time | Max. memory | |
32 | | -| --- | --- | --- | --- | --- | |
33 | | -| openai/whisper | fp32 | 5 | 10m31s | 3101MB | |
34 | | -| whisper.cpp | fp32 | 5 | 17m42s | 1581MB | |
35 | | -| whisper.cpp | fp16 | 5 | 12m39s | 873MB | |
36 | | -| faster-whisper | fp32 | 5 | 2m44s | 1675MB | |
37 | | -| faster-whisper | int8 | 5 | 2m04s | 995MB | |
38 | | - |
39 | | -*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.* |
40 | | - |
41 | | - |
42 | | -### Distil-whisper |
43 | | - |
44 | | -| Implementation | Precision | Beam size | Time | Gigaspeech WER | |
45 | | -| --- | --- | --- | --- | --- | |
46 | | -| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 | |
47 | | -| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 | |
48 | | -| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 | |
49 | | -| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 | |
50 | | - |
51 | | -*Executed with CUDA 11.4 on a NVIDIA 3090.* |
52 | | - |
53 | | -<details> |
54 | | -<summary>testing details (click to expand)</summary> |
55 | | - |
56 | | -For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting: |
57 | | -```python |
58 | | -from faster_whisper import WhisperModel |
59 | | - |
60 | | -model_size = "distil-large-v2" |
61 | | -# model_size = "distil-medium.en" |
62 | | -# Run on GPU with FP16 |
63 | | -model = WhisperModel(model_size, device="cuda", compute_type="float16") |
64 | | -segments, info = model.transcribe("audio.mp3", beam_size=5, language="en") |
65 | | -``` |
66 | | -</details> |
67 | | - |
68 | | -## Requirements |
69 | | - |
70 | | -* Python 3.8 or greater |
71 | | - |
72 | | - |
73 | | -### GPU |
74 | | - |
75 | | -GPU execution requires the following NVIDIA libraries to be installed: |
76 | | - |
77 | | -* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas) |
78 | | -* [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn) |
79 | | - |
80 | | -**Note**: Latest versions of `ctranslate2` support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the `3.24.0` version of `ctranslate2` (This can be done with `pip install --force-reinstall ctranslate2==3.24.0` or specifying the version in a `requirements.txt`). |
81 | | - |
82 | | -There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below. |
83 | | - |
84 | | -<details> |
85 | | -<summary>Other installation methods (click to expand)</summary> |
86 | | - |
87 | | - |
88 | | -**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below. |
89 | | - |
90 | | -#### Use Docker |
91 | | - |
92 | | -The libraries (cuBLAS, cuDNN) are installed in these official NVIDIA CUDA Docker images: `nvidia/cuda:12.0.0-runtime-ubuntu20.04` or `nvidia/cuda:12.0.0-runtime-ubuntu22.04`. |
93 | | - |
94 | | -#### Install with `pip` (Linux only) |
95 | | - |
96 | | -On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python. |
| 18 | +Clone the `faster-whisper-modal` repository to your local machine: |
97 | 19 |
|
98 | 20 | ```bash |
99 | | -pip install nvidia-cublas-cu12 nvidia-cudnn-cu12 |
100 | | - |
101 | | -export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'` |
102 | | -``` |
103 | | - |
104 | | -**Note**: Version 9+ of `nvidia-cudnn-cu12` appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8. |
105 | | - |
106 | | -#### Download the libraries from Purfview's repository (Windows & Linux) |
107 | | - |
108 | | -Purfview's [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) provides the required NVIDIA libraries for Windows & Linux in a [single archive](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Decompress the archive and place the libraries in a directory included in the `PATH`. |
109 | | - |
110 | | -</details> |
111 | | - |
112 | | -## Installation |
113 | | - |
114 | | -The module can be installed from [PyPI](https://pypi.org/project/faster-whisper/): |
115 | | - |
116 | | -```bash |
117 | | -pip install faster-whisper |
| 21 | +git clone https://github.com/SYSTRAN/faster-whisper.git |
| 22 | +cd faster-whisper-modal |
118 | 23 | ``` |
119 | 24 |
|
120 | | -<details> |
121 | | -<summary>Other installation methods (click to expand)</summary> |
122 | | - |
123 | | -### Install the master branch |
124 | | - |
125 | | -```bash |
126 | | -pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz" |
127 | | -``` |
128 | 25 |
|
129 | | -### Install a specific commit |
| 26 | +### 2. Install the Modal SDK |
| 27 | +Install the Modal SDK for deploying applications to the Modal cloud: |
130 | 28 |
|
131 | 29 | ```bash |
132 | | -pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz" |
133 | | -``` |
134 | | - |
135 | | -</details> |
136 | | - |
137 | | -## Usage |
138 | | - |
139 | | -### Faster-whisper |
140 | | - |
141 | | -```python |
142 | | -from faster_whisper import WhisperModel |
143 | | - |
144 | | -model_size = "large-v3" |
145 | | - |
146 | | -# Run on GPU with FP16 |
147 | | -model = WhisperModel(model_size, device="cuda", compute_type="float16") |
148 | | - |
149 | | -# or run on GPU with INT8 |
150 | | -# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16") |
151 | | -# or run on CPU with INT8 |
152 | | -# model = WhisperModel(model_size, device="cpu", compute_type="int8") |
153 | | - |
154 | | -segments, info = model.transcribe("audio.mp3", beam_size=5) |
155 | | - |
156 | | -print("Detected language '%s' with probability %f" % (info.language, info.language_probability)) |
157 | | - |
158 | | -for segment in segments: |
159 | | - print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)) |
160 | | -``` |
161 | | - |
162 | | -**Warning:** `segments` is a *generator* so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a `for` loop: |
163 | | - |
164 | | -```python |
165 | | -segments, _ = model.transcribe("audio.mp3") |
166 | | -segments = list(segments) # The transcription will actually run here. |
167 | | -``` |
168 | | - |
169 | | -### multi-segment language detection |
170 | | - |
171 | | -To directly use the model for improved language detection, the following code snippet can be used: |
172 | | - |
173 | | -```python |
174 | | -from faster_whisper import WhisperModel |
175 | | -model = WhisperModel("medium", device="cuda", compute_type="float16") |
176 | | -language_info = model.detect_language_multi_segment("audio.mp3") |
177 | | -``` |
178 | | - |
179 | | -### Batched faster-whisper |
180 | | - |
181 | | -The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper. |
182 | | - |
183 | | -```python |
184 | | -from faster_whisper import WhisperModel, BatchedInferencePipeline |
185 | | - |
186 | | -model = WhisperModel("medium", device="cuda", compute_type="float16") |
187 | | -batched_model = BatchedInferencePipeline(model=model) |
188 | | -segments, info = batched_model.transcribe("audio.mp3", batch_size=16) |
189 | | - |
190 | | -for segment in segments: |
191 | | - print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)) |
192 | | -``` |
193 | | - |
194 | | -### Faster Distil-Whisper |
195 | | - |
196 | | -The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3) |
197 | | -checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet |
198 | | -demonstrates how to run inference with distil-large-v3 on a specified audio file: |
199 | | - |
200 | | -```python |
201 | | -from faster_whisper import WhisperModel |
202 | | - |
203 | | -model_size = "distil-large-v3" |
204 | | - |
205 | | -model = WhisperModel(model_size, device="cuda", compute_type="float16") |
206 | | -segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False) |
207 | | - |
208 | | -for segment in segments: |
209 | | - print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)) |
210 | | -``` |
211 | | - |
212 | | -For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3). |
213 | | - |
214 | | -### Word-level timestamps |
215 | | - |
216 | | -```python |
217 | | -segments, _ = model.transcribe("audio.mp3", word_timestamps=True) |
218 | | - |
219 | | -for segment in segments: |
220 | | - for word in segment.words: |
221 | | - print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word)) |
| 30 | +pip install modal |
222 | 31 | ``` |
223 | 32 |
|
224 | | -### VAD filter |
225 | | - |
226 | | -The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad) model to filter out parts of the audio without speech: |
227 | | - |
228 | | -```python |
229 | | -segments, _ = model.transcribe("audio.mp3", vad_filter=True) |
230 | | -``` |
231 | | - |
232 | | -The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`: |
233 | | - |
234 | | -```python |
235 | | -segments, _ = model.transcribe( |
236 | | - "audio.mp3", |
237 | | - vad_filter=True, |
238 | | - vad_parameters=dict(min_silence_duration_ms=500), |
239 | | -) |
240 | | -``` |
241 | | - |
242 | | -### Logging |
243 | | - |
244 | | -The library logging level can be configured like this: |
245 | | - |
246 | | -```python |
247 | | -import logging |
248 | | - |
249 | | -logging.basicConfig() |
250 | | -logging.getLogger("faster_whisper").setLevel(logging.DEBUG) |
251 | | -``` |
252 | | - |
253 | | -### Going further |
254 | | - |
255 | | -See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation. |
256 | | - |
257 | | -## Community integrations |
258 | | - |
259 | | -Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list! |
260 | | - |
261 | | - |
262 | | -* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription. |
263 | | -* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment |
264 | | -* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper. |
265 | | -* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo. |
266 | | -* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS. |
267 | | -* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines. |
268 | | -* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT. |
269 | | -* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor) |
270 | | -* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux. |
271 | | -* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art. |
272 | | -* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time. |
273 | | -* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface. |
274 | | - |
275 | | -## Model conversion |
276 | | - |
277 | | -When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran). |
278 | | - |
279 | | -We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models. |
280 | | - |
281 | | -For example the command below converts the [original "large-v3" Whisper model](https://huggingface.co/openai/whisper-large-v3) and saves the weights in FP16: |
282 | | - |
| 33 | +### 3. Setup the Modal |
| 34 | +Set up Modal authentication. This will open a browser window for you to authorize access to your Modal account: |
283 | 35 | ```bash |
284 | | -pip install transformers[torch]>=4.23 |
285 | | - |
286 | | -ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2 |
287 | | ---copy_files tokenizer.json preprocessor_config.json --quantization float16 |
| 36 | +python3 -m modal setup |
288 | 37 | ``` |
289 | 38 |
|
290 | | -* The option `--model` accepts a model name on the Hub or a path to a model directory. |
291 | | -* If the option `--copy_files tokenizer.json` is not used, the tokenizer configuration is automatically downloaded when the model is loaded later. |
292 | | - |
293 | | -Models can also be converted from the code. See the [conversion API](https://opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html). |
294 | | - |
295 | | -### Load a converted model |
296 | | - |
297 | | -1. Directly load the model from a local directory: |
298 | | -```python |
299 | | -model = faster_whisper.WhisperModel("whisper-large-v3-ct2") |
| 39 | +### 4. Deploying the App on Modal |
| 40 | +Deploy the app on Modal and get the app link from terminal/Modal Dashboard |
| 41 | +```bash |
| 42 | +modal deploy app.py |
300 | 43 | ``` |
301 | 44 |
|
302 | | -2. [Upload your model to the Hugging Face Hub](https://huggingface.co/docs/transformers/model_sharing#upload-with-the-web-interface) and load it from its name: |
303 | | -```python |
304 | | -model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2") |
305 | | -``` |
306 | | - |
307 | | -## Comparing performance against other implementations |
308 | | - |
309 | | -If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular: |
| 45 | +### 5. Test Deployed App: |
| 46 | +After the code is deployed, retrieve the app link from the Modal.com Dashboard. The app link will look similar to: |
310 | 47 |
|
311 | | -* Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, `model.transcribe` uses a default beam size of 1 but here we use a default beam size of 5. |
312 | | -* When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable `OMP_NUM_THREADS`, which can be set when running your script: |
313 | | - |
314 | | -```bash |
315 | | -OMP_NUM_THREADS=4 python3 my_script.py |
316 | | -``` |
| 48 | +```bash |
| 49 | +curl --location 'https://your-name--faster-whisper-server-fastapi-wrapper.modal.run/transcribe' \ |
| 50 | +--form 'file=@"/home/user/Desktop/locean-et-lhumanite-destins-lies-lamya-essemlali-tedxorleans-128-ytshorts.savetube.me.mp3"' |
| 51 | +``` |
0 commit comments