This project is a Flask-based web application that allows users to upload a video file, extract its audio, transcribe the audio using OpenAI's Whisper model, and generate a WebVTT file with speaker placeholders. Additionally, the app performs speaker diarization via pyannote.audio and provides an interface for previewing the video with the generated subtitles, as well as editing the speaker names before downloading the final files.
- Video Upload: Upload a video file through the web interface.
- Audio Extraction: Uses
ffmpegto extract audio from the video. - Speech Transcription: Transcribes the audio using OpenAI's Whisper (using the
medium.enmodel by default). - Speaker Diarization: Uses
pyannote.audioto identify different speakers and assign them placeholder labels (e.g.,[SPEAKER_01],[SPEAKER_02]). - Editable Subtitles: Provides a preview of the video with subtitles and a form to update speaker names.
- Downloadable Files: Allows users to download the final WebVTT file and a plain text transcript.
- Python 3.7+
- Flask – Web framework for Python.
- ffmpeg – Must be installed and available in the system PATH.
- OpenAI Whisper – For speech transcription. Install via:
pip install openai-whisper
- PyTorch – Required by Whisper (install a GPU-enabled version if available).
- pyannote.audio – For speaker diarization. Install via:
pip install pyannote.audio
- Hugging Face Access Token – Required to load the
pyannote/speaker-diarizationmodel. Replace the placeholder inapp.pywith your token. - UIKit – Frontend CSS framework loaded via CDN in the HTML templates.
-
Clone the Repository:
git clone <repository_url> cd <repository_directory>
-
Create a Virtual Environment and Activate It:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the Required Packages:
pip install Flask openai-whisper pyannote.audio
Ensure that you have the appropriate version of PyTorch installed.
-
Install ffmpeg:
- On Ubuntu:
sudo apt install ffmpeg - On Windows: Download from ffmpeg.org and add it to your system PATH.
- On Ubuntu:
-
Configure Your Hugging Face Token:
- Open
app.pyand replace"YOUR_HF_TOKEN"with your actual Hugging Face access token.
- Open
-
Run the Application:
python app.py
-
Access the Web Interface:
- Open your browser and navigate to http://127.0.0.1:5000/.
-
Upload and Process a Video:
- Use the provided form (with your updated
index.html) to upload a video. - The application will extract the audio, perform speaker diarization and transcription, and generate a WebVTT file with speaker placeholders.
- Use the provided form (with your updated
-
Edit Speaker Names:
- After processing, the app displays a preview of the video (with the generated subtitles attached) on the left and a form on the right.
- Update the speaker placeholders (e.g.,
[SPEAKER_01]) with the actual speaker names.
-
Download the Final Files:
- After updating, download the final WebVTT file and the transcript from the final page.
myapp/
├── app.py # Main Flask application code.
├── templates/
│ ├── index.html # Upload form (updated version provided by the user).
│ ├── edit.html # Speaker name editing and video preview page.
│ └── final.html # Final download links for WebVTT and transcript.
├── README.md # This file.
└── .gitignore # Git ignore file.
- Temporary Files: Uploaded video files are stored in the system's temporary directory for preview purposes. Consider implementing a cleanup strategy for production.
- Error Handling: If speaker diarization fails, the app will default to using "Unknown" as the speaker label.
- Scalability: This project is intended as a proof-of-concept. For production use, consider background task queues (e.g., Celery) and persistent storage solutions.
This project is licensed under the MIT License.