🎙️ Fake My Voice

“Clone any voice. Speak in anyone’s tone.”

Fake My Voice is a deep learning–based project focused on voice cloning and speaker-conditioned speech synthesis. The system generates realistic speech that sounds like a specific person using only a few seconds of their audio.

By combining Speaker Embeddings, Sequence-to-Sequence Synthesis, and Neural Vocoding, we achieve high-quality, human-like output for multi-speaker Text-to-Speech (TTS).

🌟 About the Project

🎧 Voice Cloning: Replicate a target speaker's voice with minimal reference audio.
🤖 Deep Learning Pipeline: Utilizes state-of-the-art models like Tacotron2 and WaveGlow.
🔊 High Quality: Combines speaker-conditioned synthesis with neural vocoding for natural results.

🎯 Aim

Build a multi-speaker system capable of replicating tone, accent, and style.
Implement a unified pipeline consisting of Tacotron2, GE2E Speaker Encoder, and WaveGlow.
Produce natural speech from text with minimal data per speaker.

⚙️ System Overview

Extract: Derive a speaker embedding from reference audio.
Synthesize: Use embedding + text to generate a Mel-spectrogram.
Vocode: Feed the Mel-spectrogram into a vocoder for waveform generation.
Output: Deliver speech mimicking the original speaker.

🛠️ Tech Stack

Category	Tools / Frameworks
Language	Python 🐍
Frameworks	PyTorch, NumPy, Librosa
Audio Tools	Torchaudio, SoundFile, Matplotlib
Models	GE2E Encoder, Tacotron2, WaveGlow
Visualization	TensorBoard 📈

🏗️ Architecture

1. Speaker Encoder (GE2E) 🧠

Extracts a fixed-dimensional embedding (256-D) capturing vocal identity.
Uses Generalized End-to-End (GE2E) loss to cluster same-speaker embeddings closely and separate different speakers.
Input: Short raw audio samples.
Output: Speaker embedding vector.

2. Speech Synthesizer (Tacotron2) 🎹

Converts text to Mel-spectrograms conditioned on the speaker embedding.
Encoder: Processes phoneme/character embeddings with convolutional and recurrent layers.
Location-Sensitive Attention: Ensures smooth, monotonic progression and prevents word-skipping.
Decoder: Autoregressively generates frames using $L1$ loss.

3. Vocoder (WaveGlow) 🗣️

Converts high-resolution Mel-spectrograms into time-domain waveforms.
Input: Mel-spectrogram from Tacotron2.
Output: Final waveform audio in the cloned voice.

📊 Dataset

Dataset	Purpose	Details
LJSpeech	Tacotron2	~13k samples, Single Speaker
VoxCeleb	Speaker Embeddings	~5k samples, 40 English Speakers
VCTK Corpus	Merged Model	44 Hours, 109 English Speakers

Note: Audio was preprocessed by resampling from 48 kHz to 22.05 kHz, normalized, and truncated to $\le 4$ seconds.

📈 Results & Demos

Tacotron2 Performance

Achieved smooth Mel-spectrogram prediction after ~60 epochs.

Text: "Printing, in the only sense with which we are at present concerned"

Expected Mel	Predicted Mel	Predicted Frames

Speaker Embeddings

Achieved stable separation across 40+ speakers. t-SNE plots show distinct identity clustering.

Audio Output

Download/Listen to Audio Sample

The audio says : "Use this model to clone the voice of any user"

Losses & Metrics

Mel Prediction: Mean Squared Error (MSE) Loss.
Stop Token: Binary Cross-Entropy (BCE) Loss.

📁 File Structure

Fake_My_Voice/
├── Multi-Speaker-TTS/      # Multi-speaker synthesis logic
├── SingleSpeaker_TTS/      # Baseline TTS code
├── Speaker_Embeddings/     # GE2E extraction scripts
├── datasets.txt            # Training data references
├── requirements.txt        # Project dependencies
└── README.md               # Documentation

🚧 Challenges Faced

Working with multi-stage neural TTS pipelines presented several technical hurdles:

⚖️ Length Mismatch: Resolving discrepancies between Mel-spectrogram frames and waveform samples during loss calculation.
🎧 Sampling Rate Consistency: Standardizing audio from various sources (48 kHz to 22.05 kHz) to ensure uniform feature extraction.
🔋 GPU Optimization: Managing the high VRAM footprint of Tacotron2 and WaveGlow, especially during concurrent training.
🔇 Alignment Stability: Tackling "silent outputs" or word skipping by fine-tuning the location-sensitive attention mechanism.
📈 Embedding Sensitivity: Ensuring training convergence by prioritizing high-quality, distinct speaker embeddings from the GE2E encoder.

👥 Contributors

We are a team of passionate developers exploring the intersection of Speech Synthesis and Deep Learning.

💻 Aryan Doshi
💻 Dhiraj Shirse
💻 Nihira Neralwar

🎓 Mentors

A special thanks to our mentors for their technical guidance and support throughout the project:

Kevin Shah
Prasanna Kasar
Yash Ogale

📜 Acknowledgements & References

Community & Organizations

Community of Coders (CoC) and Project X VJTI for providing the platform and resources to build this project.

Research Papers

Datasets & Implementations

📂 LJSpeech Dataset
📂 VoxCeleb Dataset
📂 VCTK Corpus Dataset
🛠️ NVIDIA Tacotron2 + WaveGlow PyTorch Implementation

Made with ❤️ by the Fake My Voice Team

Provide feedback

Saved searches