Automatic folder recursive transcription of any file containing audio using OpenAI Whisper.
Audio files supported:
.mp3, .wav, .m4a, .flac, .aac, .ogg, .wma, .mp4, .mkv, .webm, .opus, .mov, .avi
Files will be saved in the same directory as the media file, with the same base name.
Supported output files types:
.lrc, .vtt, .srt, .txt, .json
python3 src/main.py --helpusage: python3 main.py [options]
Transcribe media files to LRC using Whisper
options:
-h, --help show this help message and exit
-m [PATH], --media [PATH]
Path to a file or directory where media files will be searched recursively
-n [MODEL], --modelname [MODEL]
available whisper models: (Default: tiny)
tiny: Smallest, fastest model with lower accuracy.
tiny.en: English-only tiny, slightly better for English tasks.
base: Balanced size, speed, and accuracy.
base.en: English-only base, improved English performance.
small: More accurate than base, but larger and slower.
small.en: English-only small, enhanced for English tasks.
medium: High accuracy, resource-intensive.
medium.en: English-only medium, optimized for English.
large: Original large model, high accuracy, heavy and slow.
large-v1: First large variant, improved accuracy and stability.
large-v2: Upgraded large-v1, better reasoning and alignment.
large-v3: Most advanced, best performance overall.
large-v3-turbo: Optimized large-v3, faster with similar accuracy.
turbo: Fastest variant, high accuracy, resource-efficient.
-v, --verbose activate verbose mode
-im, --inmemory load model entirely into RAM
-d [DEVICE], --device [DEVICE]
available devices: cpu or cuda
-st [TYPE], --sourcetype [TYPE]
available types: mp3, wav, m4a, flac, aac, ogg, wma, mp4, mkv, webm, opus, mov, avi. (Default: all)
-sl [LANGUAGE], --sourcelanguage [LANGUAGE]
-tl [LANGUAGE], --targetlanguage [LANGUAGE]
ISO 639-1 available languages:
af: afrikaans|am: amharic|ar: arabic|as: assamese|az: azerbaijani|ba: bashkir|be: belarusian|bg: bulgarian|bn: bengali|bo: tibetan|br: breton|bs: bosnian|ca: catalan|cs: czech|cy: welsh|da: danish|de: german|el: greek|en: english|es: spanish|et: estonian|eu: basque|fa: persian|fi: finnish|fo: faroese|fr: french|gl: galician|gu: gujarati|ha: hausa|haw: hawaiian|he: hebrew|hi: hindi|hr: croatian|ht: haitian creole|hu: hungarian|hy: armenian|id: indonesian|is: icelandic|it: italian|ja: japanese|jw: javanese|ka: georgian|kk: kazakh|km: khmer|kn: kannada|ko: korean|la: latin|lb: luxembourgish|ln: lingala|lo: lao|lt: lithuanian|lv: latvian|mg: malagasy|mi: maori|mk: macedonian|ml: malayalam|mn: mongolian|mr: marathi|ms: malay|mt: maltese|my: myanmar|ne: nepali|nl: dutch|nn: nynorsk|no: norwegian|oc: occitan|pa: punjabi|pl: polish|ps: pashto|pt: portuguese|ro: romanian|ru: russian|sa: sanskrit|sd: sindhi|si: sinhala|sk: slovak|sl: slovenian|sn: shona|so: somali|sq: albanian|sr: serbian|su: sundanese|sv: swedish|sw: swahili|ta: tamil|te: telugu|tg: tajik|th: thai|tk: turkmen|tl: tagalog|tr: turkish|tt: tatar|uk: ukrainian|ur: urdu|uz: uzbek|vi: vietnamese|yi: yiddish|yo: yoruba|yue: cantonese|zh: chinese. (Default: auto)
-tt [TYPE], --targettype [TYPE]
available types: lrc, txt, srt, json, vtt. (Default: lrc)
-te [ACTION], --targetexists [ACTION]
available actions: overwrite, skip, rename. (Default: skip)
-ts, --targetsuffix add suffix to target file name. (Default: false)
-ea, --exportall export original and translated text together as target files.
(Default: false)
-t TRACK, --track TRACK
extract audio track (1=first, 2=second, 3=third, etc). (Default: 1)
--temperature [TEMP] Temperature for transcription sampling (0.0 to 1.0).
Lower values increase determinism, higher values increase variability. (Default: 0.0)
--beam-size [SIZE] Number of hypotheses considered during decoding (1 to 20).
Higher values increase accuracy but slow down processing. (Default: 5)
--best-of [N] Number of transcription samples to compare (1 to 10).
Higher values improve accuracy but increase processing time. (Default: 5)
--prompt [TEXT] Initial text to guide transcription (e.g., context or
keywords). (Default: None)
Example usage:
python3 src/main.py --media ./media/sample.mp3 --modelname tiny --device cuda --verbose --sourcetype mp3 --sourcelanguage en --targetlanguage enpython3 -m venv .venv
source .venv/bin/activate # Linux/macOS
# .\venv\Scripts\activate # Windows
pip3 install -r requirements.txt# base model without in-memory
python3 src/main.py -v -m ./media/sample.mp3 -n base.en -tt lrc -te overwrite
# larger model with in-memory model
python3 src/main.py -v -m ./media/sample.mp3 -n large -im -sl en -tt lrc -te overwrite -ts
# transcribe a specific audio track
python3 src/main.py -v -m ./media/sample3trk.mp4 -n medium.en -sl en -tt lrc -te overwrite -t 2
# transcribe a specific audio track with different settings
python3 src/main.py -v -m ./media/sample3trk.mp4 -n base.en -sl en -tt lrc -te overwrite -t 2 --temperature 0.2 --beam-size 7 --best-of 5 --prompt "transcribe the voice"docker run -d -name takigrapher \
-v "./data/whisper/cache:/root/.cache/whisper" \
-v "/mnt/nas/music/:/app/media/music/" \
luizhp/takigrapher:latestdocker run -d --gpus all \
--name takigrapher \
-v "./data/whisper/cache:/root/.cache/whisper" \
-v "/mnt/nas/music/:/app/media/music/" \
luizhp/takigrapher:latest-
mount media folder
-v "/mnt/nas/music/:/app/media/music/" -
keep models in /root/.cache/whisper
-v "./data/whisper/cache:/root/.cache/whisper"
Below are some examples of how to execute the application for transcribing files or folders containing audio using different command-line options and Docker commands:
docker exec -it takigrapher bash
takigrapher -v -m ./media/sample.mp3 -n tiny.en -tt srt -te overwrite -sl en -tsdocker exec -it takigrapher takigrapher -v -m ./media/music/bandname/ -n medium -tt lrc -te overwrite -tsdocker exec -it takigrapher takigrapher -v -m ./media/music/bandname/song.mp3 -n medium -tt lrc -te overwrite -tsdocker exec -it takigrapher takigrapher -m ./media/tv/mytvshow/ -n medium.en -sl en -tt srt -te renameCheck here for the docker-compose-cpu.yaml file
Check here for the docker-compose-gpu.yaml file
This project is licensed under the GNU GPLv3 License - see the LICENSE file for details.