A voice bot that joins Google Meet calls, listens to participants, generates responses via LLM (Gemini), and speaks back with a synthesized voice (ElevenLabs).
- The bot joins a Google Meet call via headless Chromium (Playwright)
- It captures participant audio through WebRTC peer connections
- A VAD (Voice Activity Detection) detects end of speech (1.2s silence threshold)
- Audio is transcribed via ElevenLabs STT (scribe_v1)
- The transcript is sent to Gemini to generate a reply
- The reply is synthesized to audio via ElevenLabs TTS
- The audio is injected back into the meeting via WebRTC track replacement
Participant speaks → WebRTC capture → VAD → ElevenLabs STT
→ Gemini LLM → ElevenLabs TTS → WebRTC playback
- Node.js >= 18
- A Google account (to bypass the waiting room on your own meetings)
- API keys: ElevenLabs + Google Gemini
npm install
npx playwright install chromiumCopy the example env file and fill in your keys:
cp .env.example .env| Variable | Description |
|---|---|
ELEVENLABS_API_KEY |
ElevenLabs API key |
ELEVENLABS_VOICE_ID |
Voice ID to use for TTS |
ELEVENLABS_TTS_MODEL |
TTS model (default: eleven_turbo_v2_5) |
GEMINI_API_KEY |
Google Gemini API key |
GEMINI_MODEL |
Gemini model (default: gemini-2.5-flash-lite) |
To let the bot join meetings without being stuck in the waiting room, log into a Google account:
npm run loginA browser window opens. Sign into your Google account, then close the browser. The session is persisted in chrome-profile/ and reused by the headless bot.
npm startServer starts on http://localhost:3000.
POST /join
Content-Type: application/json
{
"meetId": "abc-defg-hij",
"botName": "My Agent",
"timeout": 60000
}
meetId accepts a Google Meet ID or a full URL. Returns immediately (202):
{ "sessionId": "uuid", "meetUrl": "https://...", "status": "joining" }GET /session/:sessionId
Possible statuses: idle → joining → waiting_room → in_meeting → error | closed
DELETE /session/:sessionId
GET /sessions
# Join
SESSION=$(curl -s -X POST http://localhost:3000/join \
-H "Content-Type: application/json" \
-d '{"meetId":"abc-defg-hij","botName":"TestBot"}' | jq -r .sessionId)
# Poll status
curl -s http://localhost:3000/session/$SESSION
# Leave
curl -X DELETE http://localhost:3000/session/$SESSIONsrc/
server.js # Express server (REST API, session management)
meet-bot.js # Google Meet automation (Playwright, WebRTC)
audio-pipeline.js # VAD, STT, TTS, Gemini LLM
login.js # Google login helper (persistent Chrome profile)
avatar.png # Reference avatar image
.env.example # Environment variables template
An earlier version of this project included real-time lip sync using MuseTalk. The setup was:
- A separate Python FastAPI server (
lipsync/server.py) running on port 8765 - MuseTalk models (VAE, UNet, Whisper, face parsing) loaded on GPU/MPS/CPU
- The Node.js bot would POST the TTS audio to
/generateand receive back NDJSON-streamed mouth region crops (JPEG sprites at 25fps) - The browser would composite the static avatar image + animated mouth crops on a canvas, then inject the resulting video track into WebRTC
Pipeline with lip sync:
TTS audio → MuseTalk server → mouth crops (25fps NDJSON stream)
↓
Static avatar (fetched from /avatar) + mouth overlay → Canvas → WebRTC video track
This was removed to simplify the setup (no Python/GPU dependency). The current version is audio-only. The lip sync code can be found in the git history if needed.