Skip to content

Latest commit

Β 

History

History
369 lines (295 loc) Β· 13.8 KB

File metadata and controls

369 lines (295 loc) Β· 13.8 KB

🎯 Transcriber v2.0 - Modular Transcription Pipeline

A comprehensive, modular transcription pipeline featuring AI-enhanced accuracy, flexible output formats, and professional-grade subtitle generation.

✨ Features

  • πŸ”Š Advanced Audio Processing - Noise reduction and optimization for better transcription
  • 🧠 Multiple Whisper Models - From fast (tiny) to most accurate (large)
  • πŸ€– AI Context Correction - GPT-4 powered grammar and homophone fixes
  • 🌐 Multi-language Translation - Translate transcripts while preserving timing
  • 🎭 Speaker Diarization - Automatic speaker identification
  • πŸ“„ Multiple Output Formats - FCPXML, ITT, Markdown, JSON
  • πŸ”§ Modular Architecture - Use individual modules or the unified pipeline
  • πŸ’‘ User-Friendly Interface - Clear explanations and smart defaults

πŸš€ Quick Start

1. Installation

# Clone or download the project
cd transcriber-v2.0

# Run the installation script
./install.sh

# Or install manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Setup API Keys (Optional)

# Copy the template and add your API keys
cp .env.template .env
# Edit .env and add your API keys:
# - OpenAI API key for AI features (context correction, translation)
# - Hugging Face token for advanced speaker diarization

OpenAI API Key (Optional)

  • Purpose: Enables AI context correction and multi-language translation
  • Get your key: OpenAI API Keys
  • Add to .env: OPENAI_API_KEY=sk-your_key_here
  • Without it: Basic transcription still works, but no AI enhancements

Hugging Face Token (Optional)

  • Purpose: Enables advanced AI-powered speaker diarization
  • Get your token: Hugging Face Tokens
  • Add to .env: HUGGINGFACE_TOKEN=hf_your_token_here
  • Without it: Falls back to simple timing-based speaker detection
  • Note: Required to access the pyannote/speaker-diarization-3.1 model

3. Run the Pipeline

# Start the interactive pipeline
python run_transcription_pipeline_v2.py

# Follow the guided interface to:
# 1. Choose output format (FCPXML, ITT, Markdown, JSON)
# 2. Select input files
# 3. Configure processing options
# 4. Generate results

πŸ“‹ Requirements

System Requirements

  • Python 3.8+ (Python 3.9+ recommended)
  • Disk Space: ~3-4GB for full installation with models
  • RAM: 4GB minimum, 8GB+ recommended for large models
  • FFmpeg (for audio preprocessing - highly recommended)

Core Dependencies

The following packages are automatically installed via pip install -r requirements.txt:

Audio & ML Processing

  • torch (~500MB+) - PyTorch for ML model support
  • torchaudio - Audio processing for PyTorch
  • openai-whisper - Speech transcription models
  • pyannote.audio - AI speaker diarization (requires HF token)
  • whisperx - Enhanced Whisper with word-level alignment
  • librosa - Advanced audio analysis (optional)
  • numpy, scipy - Mathematical operations

API & Integration

  • openai - OpenAI API client for AI features
  • python-dotenv - Environment variable management
  • huggingface_hub - Hugging Face model access

Output Processing

  • lxml - XML processing (FCPXML, ITT generation)
  • tqdm - Progress bars
  • ffprobe-python - Video metadata extraction

External Dependencies

  • FFmpeg - Audio/video processing (install separately)
    • macOS: brew install ffmpeg
    • Ubuntu: sudo apt-get install ffmpeg
    • Windows: Download from ffmpeg.org

API Keys (Optional)

  • OpenAI API Key (optional - for AI context correction and translation)
  • Hugging Face Token (optional - for advanced speaker diarization)

πŸ— Architecture

The v2.0 system is built on modular principles:

  • Modularity: Each script does one thing well
  • User Choice: Maximum flexibility at each decision point
  • Transparency: Clear cost estimates and feature explanations
  • Quality: Professional-grade outputs for video production workflows

πŸ”§ Individual Modules

Each module can be used independently:

# 1. Basic transcription
python scripts/transcribe.py --input-dir input --output-dir output --model base --preprocessing

# 2. Speaker diarization
python scripts/diarize_transcript.py --input-dir transcripts --output-dir diarized

# 3. AI context correction
python scripts/context_correct_transcript.py --input-dir transcripts --output-dir corrected --api-key $OPENAI_API_KEY

# 4. Translation
python scripts/translate_transcript.py --transcript-dir transcripts --output-dir translated --target-language Spanish --api-key $OPENAI_API_KEY

# 5. Generate outputs
python scripts/generate_fcpxml.py --input-dir transcripts --output-dir fcpxml
python scripts/generate_itt.py --input-dir transcripts --output-dir itt
python scripts/generate_markdown.py --input-dir transcripts --output-dir markdown --include-timecodes --include-speakers

Modular Architecture

  1. Transcription (Whisper)

    • Script: transcribe.py
    • Function: Convert audio to text with timestamps.
    • Options: Select model size (tiny, base, small, medium, large).
  2. Diarization

    • Script: diarize_transcript.py
    • Function: Add speaker labels to transcript using AI models.
    • Options: Advanced AI-based (requires HF token) or simple timing-based.
    • Models: pyannote.audio for professional speaker identification.
  3. Context Correction (AI)

    • Script: context_correct_transcript.py
    • Function: Automatic homophone/grammar correction.
    • Options: Enable/disable context correction.
  4. Translation (OpenAI)

    • Script: translate_transcript.py
    • Function: Translate text to selected language.
    • Options: Enable/disable translation, select target language.
  5. Subtitle Generation (FCPXML, ITT)

    • Scripts: generate_fcpxml.py, generate_itt.py
    • Function: Convert transcript to subtitle files.
    • Options: Include timecodes, speaker names.
  6. Markdown Generation

    • Script: generate_markdown.py
    • Function: Output readable transcripts.
    • Options: Include timecodes, speaker names.
  7. JSON Export

    • Script: export_json.py
    • Function: Output raw transcription data.
    • Options: Enable/disable preprocessing.

Detailed Testing Matrix

Source Files

Primary Test Video: sample_video.mp4 (example test file)

  • Size: Variable depending on your test file
  • Should contain clear speech suitable for transcription, translation, and diarization testing
  • Use the same test file for ALL test scenarios to ensure consistent comparison across routes

Test Scenarios

1. FCPXML Route Tests

fcpxml/
β”œβ”€β”€ basic_english/                    # Raw transcription β†’ FCPXML
β”œβ”€β”€ corrected_english/               # Transcription β†’ Context Correct β†’ FCPXML  
β”œβ”€β”€ translated_spanish/              # Transcription β†’ Translate(Spanish) β†’ FCPXML
β”œβ”€β”€ corrected_translated_mandarin/   # Transcription β†’ Context Correct β†’ Translate(Mandarin) β†’ FCPXML
└── model_comparison/                # Same file with different Whisper models

2. ITT Route Tests

itt/
β”œβ”€β”€ basic_english/                   # Raw transcription β†’ ITT
β”œβ”€β”€ corrected_english/              # Transcription β†’ Context Correct β†’ ITT
β”œβ”€β”€ translated_french/              # Transcription β†’ Translate(French) β†’ ITT
└── multilanguage_comparison/       # Same content in multiple target languages

3. Markdown Route Tests

markdown/
β”œβ”€β”€ basic_transcript/               # Raw transcription β†’ Markdown (no speakers, no timecodes)
β”œβ”€β”€ with_timecodes/                 # Raw transcription β†’ Markdown (timecodes only)
β”œβ”€β”€ with_speakers/                  # Transcription β†’ Diarize β†’ Markdown (speakers, no timecodes)
β”œβ”€β”€ full_featured/                  # Transcription β†’ Diarize β†’ Markdown (speakers + timecodes)
β”œβ”€β”€ corrected_content/              # Transcription β†’ Context Correct β†’ Markdown
β”œβ”€β”€ translated_content/             # Transcription β†’ Translate β†’ Markdown
└── complete_pipeline/              # Transcription β†’ Diarize β†’ Context Correct β†’ Translate β†’ Markdown

4. JSON Route Tests

json/
β”œβ”€β”€ raw_whisper_output/             # Pure Whisper transcription
β”œβ”€β”€ diarized_content/               # Transcription β†’ Diarize β†’ JSON
β”œβ”€β”€ context_corrected/              # Transcription β†’ Context Correct β†’ JSON
β”œβ”€β”€ translated_versions/            # Transcription β†’ Translate β†’ JSON
└── full_processing/                # All processing steps β†’ JSON

Implementation Roadmap

Phase 1: Extract Core Modules (from existing v1.0 codebase)

  1. Extract transcribe.py from existing transcription logic
  2. Extract context_correct_transcript.py from context correction functions
  3. Copy and adapt translate_transcript.py (already modular)
  4. Extract generate_fcpxml.py from FCPXML generation logic
  5. Extract generate_itt.py from ITT generation logic
  6. Create new generate_markdown.py with flexible options
  7. Create diarize_transcript.py from existing diarization logic

Phase 2: Build New Unified Interface

  1. Create run_transcription_pipeline_v2.py that orchestrates modules
  2. Implement decision tree with all user choices
  3. Preserve existing features: cost estimation, video metadata, JSON reuse
  4. Add new features: flexible markdown options, module selection

Phase 3: Autonomous Testing

  1. Create test runner script
  2. Execute all test scenarios automatically
  3. Generate outputs in organized directory structure
  4. Create validation reports

πŸ“ Project Structure

transcriber-v2.0/
β”œβ”€β”€ README.md                        # Project documentation
β”œβ”€β”€ LICENSE                          # MIT License
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ setup.py                         # Package setup
β”œβ”€β”€ install.sh                       # Installation script
β”œβ”€β”€ .env.template                    # Environment template
β”œβ”€β”€ .gitignore                       # Git ignore rules
β”œβ”€β”€ run_transcription_pipeline_v2.py # Unified pipeline interface
β”œβ”€β”€ scripts/                         # Modular scripts
β”‚   β”œβ”€β”€ transcribe.py                # Whisper transcription
β”‚   β”œβ”€β”€ diarize_transcript.py        # Speaker identification
β”‚   β”œβ”€β”€ context_correct_transcript.py # AI grammar correction
β”‚   β”œβ”€β”€ translate_transcript.py      # Multi-language translation
β”‚   β”œβ”€β”€ generate_fcpxml.py           # Final Cut Pro XML
β”‚   β”œβ”€β”€ generate_itt.py              # ITT subtitles
β”‚   β”œβ”€β”€ generate_markdown.py         # Readable transcripts
β”‚   └── run_all_tests.py             # Automated testing
β”œβ”€β”€ input/                           # Input media files
β”œβ”€β”€ output/                          # Generated results
β”œβ”€β”€ tests/                           # Unit tests
β”œβ”€β”€ docs/                            # Documentation
└── sample_inputs/                   # Sample files

πŸŽ› Processing Options

Audio Preprocessing (Recommended)

  • Noise reduction using bandpass filtering
  • 16kHz resampling for optimal Whisper performance
  • Automatic fallback if FFmpeg unavailable

AI Enhancements (Optional)

  • Context Correction: Fixes grammar and homophones using GPT-4
  • Translation: Supports 8+ languages with natural translation
  • Cost transparent: Clear estimates before processing

Output Formats

  • FCPXML: Professional subtitles for Final Cut Pro
  • ITT: Standard subtitles for video players
  • Markdown: Human-readable transcripts with speakers/timecodes
  • JSON: Raw data for custom processing

πŸ§” Testing

# Run all automated tests
python scripts/run_all_tests.py

# Run unit tests
python -m pytest tests/

# Test individual modules
python scripts/transcribe.py --help

βš™οΈ Troubleshooting

Common Installation Issues

PyTorch Installation Problems

# If torch installation fails, try installing separately:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

# For GPU support (NVIDIA):
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

FFmpeg Not Found

  • macOS: brew install ffmpeg
  • Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
  • Windows: Download from ffmpeg.org and add to PATH

Speaker Diarization Issues

# If pyannote.audio fails to load models:
# 1. Ensure you have a valid Hugging Face token
# 2. Accept the model license at: https://huggingface.co/pyannote/speaker-diarization-3.1
# 3. Check token permissions include model access

Memory Issues

  • Large Whisper models: Use smaller models (tiny/base) for limited RAM
  • Speaker diarization: Disable if encountering OOM errors
  • Long audio files: Process in shorter segments

Network/Download Issues

# If model downloads fail:
# 1. Check internet connection
# 2. Verify Hugging Face token is valid
# 3. Try downloading models manually:
python -c "import whisper; whisper.load_model('base')"

πŸ“– Documentation

  • Interactive Pipeline: Run python run_transcription_pipeline_v2.py for guided setup
  • Module Help: Each script has --help for detailed options
  • Cost Estimation: AI features show cost estimates before processing
  • Error Handling: Graceful fallbacks and clear error messages

🀝 Contributing

The modular architecture makes it easy to:

  • Add new output formats
  • Integrate different AI models
  • Customize processing steps
  • Add new languages

πŸ“„ License

MIT License - see LICENSE file for details.


Transcriber v2.0 - Professional transcription made simple and modular.