“Clone any voice. Speak in anyone’s tone.”
Fake My Voice is a deep learning–based project focused on voice cloning and speaker-conditioned speech synthesis. The system generates realistic speech that sounds like a specific person using only a few seconds of their audio.
By combining Speaker Embeddings, Sequence-to-Sequence Synthesis, and Neural Vocoding, we achieve high-quality, human-like output for multi-speaker Text-to-Speech (TTS).
- 🌟 About the Project
- 🎯 Aim
- 🏗️ Architecture
- 🛠️ Tech Stack
- 📊 Dataset
- 📈 Results & Demos
- 📁 File Structure
- 🚧 Challenges Faced
- 👥 Contributors
- 🎧 Voice Cloning: Replicate a target speaker's voice with minimal reference audio.
- 🤖 Deep Learning Pipeline: Utilizes state-of-the-art models like Tacotron2 and WaveGlow.
- 🔊 High Quality: Combines speaker-conditioned synthesis with neural vocoding for natural results.
- Build a multi-speaker system capable of replicating tone, accent, and style.
- Implement a unified pipeline consisting of Tacotron2, GE2E Speaker Encoder, and WaveGlow.
- Produce natural speech from text with minimal data per speaker.
- Extract: Derive a speaker embedding from reference audio.
- Synthesize: Use embedding + text to generate a Mel-spectrogram.
- Vocode: Feed the Mel-spectrogram into a vocoder for waveform generation.
- Output: Deliver speech mimicking the original speaker.
| Category | Tools / Frameworks |
|---|---|
| Language | Python 🐍 |
| Frameworks | PyTorch, NumPy, Librosa |
| Audio Tools | Torchaudio, SoundFile, Matplotlib |
| Models | GE2E Encoder, Tacotron2, WaveGlow |
| Visualization | TensorBoard 📈 |
- Extracts a fixed-dimensional embedding (256-D) capturing vocal identity.
- Uses Generalized End-to-End (GE2E) loss to cluster same-speaker embeddings closely and separate different speakers.
- Input: Short raw audio samples.
- Output: Speaker embedding vector.
- Converts text to Mel-spectrograms conditioned on the speaker embedding.
- Encoder: Processes phoneme/character embeddings with convolutional and recurrent layers.
- Location-Sensitive Attention: Ensures smooth, monotonic progression and prevents word-skipping.
-
Decoder: Autoregressively generates frames using
$L1$ loss.
- Converts high-resolution Mel-spectrograms into time-domain waveforms.
- Input: Mel-spectrogram from Tacotron2.
- Output: Final waveform audio in the cloned voice.
| Dataset | Purpose | Details |
|---|---|---|
| LJSpeech | Tacotron2 | ~13k samples, Single Speaker |
| VoxCeleb | Speaker Embeddings | ~5k samples, 40 English Speakers |
| VCTK Corpus | Merged Model | 44 Hours, 109 English Speakers |
Note: Audio was preprocessed by resampling from 48 kHz to 22.05 kHz, normalized, and truncated to
$\le 4$ seconds.
Achieved smooth Mel-spectrogram prediction after ~60 epochs.
Text: "Printing, in the only sense with which we are at present concerned"
| Expected Mel | Predicted Mel | Predicted Frames |
|---|---|---|
![]() |
![]() |
![]() |
Achieved stable separation across 40+ speakers. t-SNE plots show distinct identity clustering.

Download/Listen to Audio Sample
The audio says : "Use this model to clone the voice of any user"
Fake_My_Voice/
├── Multi-Speaker-TTS/ # Multi-speaker synthesis logic
├── SingleSpeaker_TTS/ # Baseline TTS code
├── Speaker_Embeddings/ # GE2E extraction scripts
├── datasets.txt # Training data references
├── requirements.txt # Project dependencies
└── README.md # DocumentationWorking with multi-stage neural TTS pipelines presented several technical hurdles:
- ⚖️ Length Mismatch: Resolving discrepancies between Mel-spectrogram frames and waveform samples during loss calculation.
- 🎧 Sampling Rate Consistency: Standardizing audio from various sources (48 kHz to 22.05 kHz) to ensure uniform feature extraction.
- 🔋 GPU Optimization: Managing the high VRAM footprint of Tacotron2 and WaveGlow, especially during concurrent training.
- 🔇 Alignment Stability: Tackling "silent outputs" or word skipping by fine-tuning the location-sensitive attention mechanism.
- 📈 Embedding Sensitivity: Ensuring training convergence by prioritizing high-quality, distinct speaker embeddings from the GE2E encoder.
We are a team of passionate developers exploring the intersection of Speech Synthesis and Deep Learning.
- 💻 Aryan Doshi
- 💻 Dhiraj Shirse
- 💻 Nihira Neralwar
A special thanks to our mentors for their technical guidance and support throughout the project:
- Kevin Shah
- Prasanna Kasar
- Yash Ogale
- Community of Coders (CoC) and Project X VJTI for providing the platform and resources to build this project.
- 📄 Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- 📄 GE2E: Generalized End-to-End Loss for Speaker Verification
- 📄 WaveGlow: A Flow-based Generative Network for Speech Synthesis
- 📂 LJSpeech Dataset
- 📂 VoxCeleb Dataset
- 📂 VCTK Corpus Dataset
- 🛠️ NVIDIA Tacotron2 + WaveGlow PyTorch Implementation
Made with ❤️ by the Fake My Voice Team




