🎯 Transcriber v2.0 - Modular Transcription Pipeline

A comprehensive, modular transcription pipeline featuring AI-enhanced accuracy, flexible output formats, and professional-grade subtitle generation.

✨ Features

🔊 Advanced Audio Processing - Noise reduction and optimization for better transcription
🧠 Multiple Whisper Models - From fast (tiny) to most accurate (large)
🤖 AI Context Correction - GPT-4 powered grammar and homophone fixes
🌐 Multi-language Translation - Translate transcripts while preserving timing
🎭 Speaker Diarization - Automatic speaker identification
📄 Multiple Output Formats - FCPXML, ITT, Markdown, JSON
🔧 Modular Architecture - Use individual modules or the unified pipeline
💡 User-Friendly Interface - Clear explanations and smart defaults

🚀 Quick Start

1. Installation

# Clone or download the project
cd transcriber-v2.0

# Run the installation script
./install.sh

# Or install manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Setup API Keys (Optional)

# Copy the template and add your API keys
cp .env.template .env
# Edit .env and add your API keys:
# - OpenAI API key for AI features (context correction, translation)
# - Hugging Face token for advanced speaker diarization

OpenAI API Key (Optional)

Purpose: Enables AI context correction and multi-language translation
Get your key: OpenAI API Keys
Add to .env: OPENAI_API_KEY=sk-your_key_here
Without it: Basic transcription still works, but no AI enhancements

Hugging Face Token (Optional)

Purpose: Enables advanced AI-powered speaker diarization
Get your token: Hugging Face Tokens
Add to .env: HUGGINGFACE_TOKEN=hf_your_token_here
Without it: Falls back to simple timing-based speaker detection
Note: Required to access the pyannote/speaker-diarization-3.1 model

3. Run the Pipeline

# Start the interactive pipeline
python run_transcription_pipeline_v2.py

# Follow the guided interface to:
# 1. Choose output format (FCPXML, ITT, Markdown, JSON)
# 2. Select input files
# 3. Configure processing options
# 4. Generate results

📋 Requirements

System Requirements

Python 3.8+ (Python 3.9+ recommended)
Disk Space: ~3-4GB for full installation with models
RAM: 4GB minimum, 8GB+ recommended for large models
FFmpeg (for audio preprocessing - highly recommended)

Core Dependencies

The following packages are automatically installed via pip install -r requirements.txt:

Audio & ML Processing

torch (~500MB+) - PyTorch for ML model support
torchaudio - Audio processing for PyTorch
openai-whisper - Speech transcription models
pyannote.audio - AI speaker diarization (requires HF token)
whisperx - Enhanced Whisper with word-level alignment
librosa - Advanced audio analysis (optional)
numpy, scipy - Mathematical operations

API & Integration

openai - OpenAI API client for AI features
python-dotenv - Environment variable management
huggingface_hub - Hugging Face model access

Output Processing

lxml - XML processing (FCPXML, ITT generation)
tqdm - Progress bars
ffprobe-python - Video metadata extraction

External Dependencies

FFmpeg - Audio/video processing (install separately)
- macOS: brew install ffmpeg
- Ubuntu: sudo apt-get install ffmpeg
- Windows: Download from ffmpeg.org

API Keys (Optional)

OpenAI API Key (optional - for AI context correction and translation)
Hugging Face Token (optional - for advanced speaker diarization)

🏗 Architecture

The v2.0 system is built on modular principles:

Modularity: Each script does one thing well
User Choice: Maximum flexibility at each decision point
Transparency: Clear cost estimates and feature explanations
Quality: Professional-grade outputs for video production workflows

🔧 Individual Modules

Each module can be used independently:

# 1. Basic transcription
python scripts/transcribe.py --input-dir input --output-dir output --model base --preprocessing

# 2. Speaker diarization
python scripts/diarize_transcript.py --input-dir transcripts --output-dir diarized

# 3. AI context correction
python scripts/context_correct_transcript.py --input-dir transcripts --output-dir corrected --api-key $OPENAI_API_KEY

# 4. Translation
python scripts/translate_transcript.py --transcript-dir transcripts --output-dir translated --target-language Spanish --api-key $OPENAI_API_KEY

# 5. Generate outputs
python scripts/generate_fcpxml.py --input-dir transcripts --output-dir fcpxml
python scripts/generate_itt.py --input-dir transcripts --output-dir itt
python scripts/generate_markdown.py --input-dir transcripts --output-dir markdown --include-timecodes --include-speakers

Modular Architecture

Transcription (Whisper)
- Script: transcribe.py
- Function: Convert audio to text with timestamps.
- Options: Select model size (tiny, base, small, medium, large).
Diarization
- Script: diarize_transcript.py
- Function: Add speaker labels to transcript using AI models.
- Options: Advanced AI-based (requires HF token) or simple timing-based.
- Models: pyannote.audio for professional speaker identification.
Context Correction (AI)
- Script: context_correct_transcript.py
- Function: Automatic homophone/grammar correction.
- Options: Enable/disable context correction.
Translation (OpenAI)
- Script: translate_transcript.py
- Function: Translate text to selected language.
- Options: Enable/disable translation, select target language.
Subtitle Generation (FCPXML, ITT)
- Scripts: generate_fcpxml.py, generate_itt.py
- Function: Convert transcript to subtitle files.
- Options: Include timecodes, speaker names.
Markdown Generation
- Script: generate_markdown.py
- Function: Output readable transcripts.
- Options: Include timecodes, speaker names.
JSON Export
- Script: export_json.py
- Function: Output raw transcription data.
- Options: Enable/disable preprocessing.

Detailed Testing Matrix

Source Files

Primary Test Video: sample_video.mp4 (example test file)

Size: Variable depending on your test file
Should contain clear speech suitable for transcription, translation, and diarization testing
Use the same test file for ALL test scenarios to ensure consistent comparison across routes

Test Scenarios

1. FCPXML Route Tests

fcpxml/
├── basic_english/                    # Raw transcription → FCPXML
├── corrected_english/               # Transcription → Context Correct → FCPXML  
├── translated_spanish/              # Transcription → Translate(Spanish) → FCPXML
├── corrected_translated_mandarin/   # Transcription → Context Correct → Translate(Mandarin) → FCPXML
└── model_comparison/                # Same file with different Whisper models

2. ITT Route Tests

itt/
├── basic_english/                   # Raw transcription → ITT
├── corrected_english/              # Transcription → Context Correct → ITT
├── translated_french/              # Transcription → Translate(French) → ITT
└── multilanguage_comparison/       # Same content in multiple target languages

3. Markdown Route Tests

markdown/
├── basic_transcript/               # Raw transcription → Markdown (no speakers, no timecodes)
├── with_timecodes/                 # Raw transcription → Markdown (timecodes only)
├── with_speakers/                  # Transcription → Diarize → Markdown (speakers, no timecodes)
├── full_featured/                  # Transcription → Diarize → Markdown (speakers + timecodes)
├── corrected_content/              # Transcription → Context Correct → Markdown
├── translated_content/             # Transcription → Translate → Markdown
└── complete_pipeline/              # Transcription → Diarize → Context Correct → Translate → Markdown

4. JSON Route Tests

json/
├── raw_whisper_output/             # Pure Whisper transcription
├── diarized_content/               # Transcription → Diarize → JSON
├── context_corrected/              # Transcription → Context Correct → JSON
├── translated_versions/            # Transcription → Translate → JSON
└── full_processing/                # All processing steps → JSON

Implementation Roadmap

Phase 1: Extract Core Modules (from existing v1.0 codebase)

Extract transcribe.py from existing transcription logic
Extract context_correct_transcript.py from context correction functions
Copy and adapt translate_transcript.py (already modular)
Extract generate_fcpxml.py from FCPXML generation logic
Extract generate_itt.py from ITT generation logic
Create new generate_markdown.py with flexible options
Create diarize_transcript.py from existing diarization logic

Phase 2: Build New Unified Interface

Create run_transcription_pipeline_v2.py that orchestrates modules
Implement decision tree with all user choices
Preserve existing features: cost estimation, video metadata, JSON reuse
Add new features: flexible markdown options, module selection

Phase 3: Autonomous Testing

Create test runner script
Execute all test scenarios automatically
Generate outputs in organized directory structure
Create validation reports

📁 Project Structure

transcriber-v2.0/
├── README.md                        # Project documentation
├── LICENSE                          # MIT License
├── requirements.txt                 # Python dependencies
├── setup.py                         # Package setup
├── install.sh                       # Installation script
├── .env.template                    # Environment template
├── .gitignore                       # Git ignore rules
├── run_transcription_pipeline_v2.py # Unified pipeline interface
├── scripts/                         # Modular scripts
│   ├── transcribe.py                # Whisper transcription
│   ├── diarize_transcript.py        # Speaker identification
│   ├── context_correct_transcript.py # AI grammar correction
│   ├── translate_transcript.py      # Multi-language translation
│   ├── generate_fcpxml.py           # Final Cut Pro XML
│   ├── generate_itt.py              # ITT subtitles
│   ├── generate_markdown.py         # Readable transcripts
│   └── run_all_tests.py             # Automated testing
├── input/                           # Input media files
├── output/                          # Generated results
├── tests/                           # Unit tests
├── docs/                            # Documentation
└── sample_inputs/                   # Sample files

🎛 Processing Options

Audio Preprocessing (Recommended)

Noise reduction using bandpass filtering
16kHz resampling for optimal Whisper performance
Automatic fallback if FFmpeg unavailable

AI Enhancements (Optional)

Context Correction: Fixes grammar and homophones using GPT-4
Translation: Supports 8+ languages with natural translation
Cost transparent: Clear estimates before processing

Output Formats

FCPXML: Professional subtitles for Final Cut Pro
ITT: Standard subtitles for video players
Markdown: Human-readable transcripts with speakers/timecodes
JSON: Raw data for custom processing

🧔 Testing

# Run all automated tests
python scripts/run_all_tests.py

# Run unit tests
python -m pytest tests/

# Test individual modules
python scripts/transcribe.py --help

⚙️ Troubleshooting

Common Installation Issues

PyTorch Installation Problems

# If torch installation fails, try installing separately:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

# For GPU support (NVIDIA):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

FFmpeg Not Found

macOS: brew install ffmpeg
Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
Windows: Download from ffmpeg.org and add to PATH

Speaker Diarization Issues

# If pyannote.audio fails to load models:
# 1. Ensure you have a valid Hugging Face token
# 2. Accept the model license at: https://huggingface.co/pyannote/speaker-diarization-3.1
# 3. Check token permissions include model access

Memory Issues

Large Whisper models: Use smaller models (tiny/base) for limited RAM
Speaker diarization: Disable if encountering OOM errors
Long audio files: Process in shorter segments

Network/Download Issues

# If model downloads fail:
# 1. Check internet connection
# 2. Verify Hugging Face token is valid
# 3. Try downloading models manually:
python -c "import whisper; whisper.load_model('base')"

📖 Documentation

Interactive Pipeline: Run python run_transcription_pipeline_v2.py for guided setup
Module Help: Each script has --help for detailed options
Cost Estimation: AI features show cost estimates before processing
Error Handling: Graceful fallbacks and clear error messages

🤝 Contributing

The modular architecture makes it easy to:

Add new output formats
Integrate different AI models
Customize processing steps
Add new languages

📄 License

MIT License - see LICENSE file for details.

Transcriber v2.0 - Professional transcription made simple and modular.

FilesExpand file tree

README.md

Latest commit

History