Skip to content

feat(voice): add voice mode proof-of-concept with Live API integration#20923

Closed
himanshu748 wants to merge 1 commit intogoogle-gemini:mainfrom
himanshu748:feat/voice-mode-poc
Closed

feat(voice): add voice mode proof-of-concept with Live API integration#20923
himanshu748 wants to merge 1 commit intogoogle-gemini:mainfrom
himanshu748:feat/voice-mode-poc

Conversation

@himanshu748
Copy link
Copy Markdown

@himanshu748 himanshu748 commented Mar 3, 2026

Summary

Proof-of-concept implementation of voice mode for Gemini CLI, demonstrating the integration pattern for the Gemini Live API's bidirectional WebSocket streaming. This lays the groundwork for hands-free multimodal voice interaction (GSoC 2026 Project 11).

Related to #18067

What's included

Core module (packages/core/src/voice/)

  • VoiceService class wrapping @google/genai Live API (GoogleGenAI.live.connect())
  • Full client→server message support: sendText(), sendAudio(), sendAudioStreamEnd(), sendToolResponse(), sendInterrupt()
  • Typed event emitter for all server→client messages: text responses, audio chunks, input/output transcriptions, tool calls, tool call cancellations, goAway, state changes
  • State machine: IDLE → CONNECTING → CONNECTED → LISTENING ↔ RESPONDING
  • VoiceConfig with sensible defaults (model, response modality, voice, VAD, sample rates)
  • buildSpeechConfig() helper for SDK speech configuration
  • 54 unit tests covering lifecycle, messaging, error handling, and event dispatch

CLI integration (packages/cli/)

  • VoiceMode Ink component: state indicator, ASCII waveform visualization, transcription display, keyboard controls (ESC to exit, m to mute, Space to interrupt)
  • /voice slash command using OpenCustomDialogActionReturn pattern (same pattern as /hooks)
  • Wired into BuiltinCommandLoader

What's NOT included (future work)

  • Audio I/O: Platform-specific microphone/speaker bindings (e.g., naudiodon/PortAudio) are not integrated. The VoiceService backend is fully functional for WebSocket streaming — it just needs audio bytes piped in.
  • Tool execution bridge: The TOOL_CALL events are emitted but not yet routed to the existing ToolRegistry
  • Session resumption: The Live API supports sessionResumption but it's not wired up yet
  • Settings integration: No voice key in settings schema yet

Architecture decisions

  1. Live API over REST+STT: Uses the native bidirectional WebSocket API (BidiGenerateContent) rather than bolting STT/TTS onto the existing HTTP streaming pipeline. This gives us server-side VAD, barge-in support, and lower latency.

  2. Typed event emitter: Rather than extending EventEmitter (which loses type safety), VoiceService uses a composition pattern with typed on<E>() / off<E>() methods and a VoiceEventMap interface.

  3. LiveServerMessage typed parameter: handleServerMessage() accepts the SDK's LiveServerMessage type directly (from onmessage callback), avoiding as casts and eslint-disable suppressions.

  4. Custom dialog pattern: The /voice command follows the established OpenCustomDialogActionReturn pattern (same as /hooks), so it integrates cleanly with the existing command system.

Testing

npx vitest run packages/core/src/voice/voice-service.test.ts
# 54 tests passing

…oice command

Implement a proof-of-concept voice mode for the Gemini CLI demonstrating
the integration pattern for the Gemini Live API's bidirectional WebSocket
streaming. This lays the groundwork for hands-free multimodal voice interaction.

Core module (packages/core/src/voice/):
- VoiceService class wrapping @google/genai Live API session management
- Full client-to-server message support (text, audio, tool responses, interrupts)
- Typed event emitter for all server-to-client messages (text, audio,
  transcriptions, tool calls, cancellations, goAway, state changes)
- State machine (IDLE -> CONNECTING -> CONNECTED -> LISTENING/RESPONDING)
- 54 unit tests covering lifecycle, messaging, error handling, and events

CLI integration (packages/cli/):
- VoiceMode Ink component with state display, ASCII waveform, transcriptions
- /voice slash command using OpenCustomDialogActionReturn pattern
- Wired into BuiltinCommandLoader

Related to google-gemini#18067
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

You already have 7 pull requests open. Please work on getting existing PRs merged before opening more.

@anowardear062-svg
Copy link
Copy Markdown

Thanks for the update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants