Real-time browser-to-server audio streaming with speech recognition, translation, and text-to-speech.
- Browser audio capture with WebM/Opus encoding
- Real-time WebSocket streaming
- Voice activity detection (Silero VAD)
- Speech-to-text (Faster-Whisper)
- Machine translation (NLLB-200 via CTranslate2)
- Text-to-speech (Piper TTS)
- Live transcript and translation display
Browser Server
+-----------------+ +--------------------+
| MediaRecorder |--- WebSocket -->| FastAPI |
| (WebM/Opus) | | PyAV Decoder |
| Web Audio API | | Silero VAD |
| AudioContext |<-- JSON/binary --| Faster-Whisper ASR |
| (TTS playback) | | NLLB-200 MT |
+-----------------+ | Piper TTS |
+--------------------+
Pipeline: Audio -> VAD -> ASR -> MT -> TTS -> Browser playback
- Python 3.11+
- Node.js 22+
- pnpm
cd backend
uv run uvicorn api.main:app --reload --port 8000On first run, models are downloaded automatically:
- Faster-Whisper (ASR)
- NLLB-200-distilled-1.3B (translation, converted to CTranslate2 int8)
- Piper voice models for en_US, es_ES, pt_BR
cd frontend
pnpm install
pnpm devOpen http://localhost:5173 and click Start Recording.
audio-streaming-poc/
├── backend/
│ ├── api/
│ │ └── main.py # FastAPI WebSocket server
│ ├── services/
│ │ ├── asr_service.py # Faster-Whisper ASR
│ │ ├── mt_service.py # NLLB-200 translation
│ │ ├── tts_service.py # Piper TTS
│ │ └── vad/
│ │ └── vad_service.py # Silero VAD
│ ├── models/ # Downloaded models (gitignored)
│ ├── recordings/ # Saved WAV files
│ └── requirements.txt
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ │ └── AudioRecorder.tsx
│ │ └── App.tsx
│ └── package.json
└── ARCHITECTURE.md
| Layer | Technology |
|---|---|
| Frontend | React 19, TypeScript, Vite |
| Backend | FastAPI, Python 3.11 |
| Audio | MediaRecorder API, PyAV |
| VAD | Silero VAD |
| ASR | Faster-Whisper (small) |
| Translation | NLLB-200-distilled-1.3B, CTranslate2 |
| TTS | Piper TTS (ONNX) |
| Transport | WebSocket (JSON + binary) |
- Browser microphone access
- MediaRecorder with WebM/Opus
- WebSocket streaming to server
- WAV file saving
- Silero VAD integration
- Speech start/end detection
- 500ms silence threshold
- Client-side visual feedback
- Faster-Whisper ASR (small model)
- Partial transcripts while speaking
- Final transcripts on utterance end
- Language auto-detection (en, es, pt)
- NLLB-200-distilled-1.3B via CTranslate2 (int8)
- Supports en, es, pt language pairs
- Translation runs in background threads
- Live translation display in UI
- Piper TTS with per-language voice models
- WAV audio sent as binary WebSocket frames
- Browser-side AudioContext playback queue
- Toggle to enable/disable TTS