feat(voice): add voice mode proof-of-concept with Live API integration by himanshu748 · Pull Request #20923 · google-gemini/gemini-cli

himanshu748 · 2026-03-03T05:38:08Z

Summary

Proof-of-concept implementation of voice mode for Gemini CLI, demonstrating the integration pattern for the Gemini Live API's bidirectional WebSocket streaming. This lays the groundwork for hands-free multimodal voice interaction (GSoC 2026 Project 11).

Related to #18067

What's included

Core module (`packages/core/src/voice/`)

VoiceService class wrapping @google/genai Live API (GoogleGenAI.live.connect())
Full client→server message support: sendText(), sendAudio(), sendAudioStreamEnd(), sendToolResponse(), sendInterrupt()
Typed event emitter for all server→client messages: text responses, audio chunks, input/output transcriptions, tool calls, tool call cancellations, goAway, state changes
State machine: IDLE → CONNECTING → CONNECTED → LISTENING ↔ RESPONDING
VoiceConfig with sensible defaults (model, response modality, voice, VAD, sample rates)
buildSpeechConfig() helper for SDK speech configuration
54 unit tests covering lifecycle, messaging, error handling, and event dispatch

CLI integration (`packages/cli/`)

VoiceMode Ink component: state indicator, ASCII waveform visualization, transcription display, keyboard controls (ESC to exit, m to mute, Space to interrupt)
/voice slash command using OpenCustomDialogActionReturn pattern (same pattern as /hooks)
Wired into BuiltinCommandLoader

What's NOT included (future work)

Audio I/O: Platform-specific microphone/speaker bindings (e.g., naudiodon/PortAudio) are not integrated. The VoiceService backend is fully functional for WebSocket streaming — it just needs audio bytes piped in.
Tool execution bridge: The TOOL_CALL events are emitted but not yet routed to the existing ToolRegistry
Session resumption: The Live API supports sessionResumption but it's not wired up yet
Settings integration: No voice key in settings schema yet

Architecture decisions

Live API over REST+STT: Uses the native bidirectional WebSocket API (BidiGenerateContent) rather than bolting STT/TTS onto the existing HTTP streaming pipeline. This gives us server-side VAD, barge-in support, and lower latency.
Typed event emitter: Rather than extending EventEmitter (which loses type safety), VoiceService uses a composition pattern with typed on<E>() / off<E>() methods and a VoiceEventMap interface.
LiveServerMessage typed parameter: handleServerMessage() accepts the SDK's LiveServerMessage type directly (from onmessage callback), avoiding as casts and eslint-disable suppressions.
Custom dialog pattern: The /voice command follows the established OpenCustomDialogActionReturn pattern (same as /hooks), so it integrates cleanly with the existing command system.

Testing

npx vitest run packages/core/src/voice/voice-service.test.ts
# 54 tests passing

…oice command Implement a proof-of-concept voice mode for the Gemini CLI demonstrating the integration pattern for the Gemini Live API's bidirectional WebSocket streaming. This lays the groundwork for hands-free multimodal voice interaction. Core module (packages/core/src/voice/): - VoiceService class wrapping @google/genai Live API session management - Full client-to-server message support (text, audio, tool responses, interrupts) - Typed event emitter for all server-to-client messages (text, audio, transcriptions, tool calls, cancellations, goAway, state changes) - State machine (IDLE -> CONNECTING -> CONNECTED -> LISTENING/RESPONDING) - 54 unit tests covering lifecycle, messaging, error handling, and events CLI integration (packages/cli/): - VoiceMode Ink component with state display, ASCII waveform, transcriptions - /voice slash command using OpenCustomDialogActionReturn pattern - Wired into BuiltinCommandLoader Related to google-gemini#18067

github-actions · 2026-03-03T05:38:25Z

You already have 7 pull requests open. Please work on getting existing PRs merged before opening more.

anowardear062-svg · 2026-03-04T13:43:39Z

Thanks for the update

github-actions bot closed this Mar 3, 2026

github-actions bot mentioned this pull request Mar 3, 2026

📊 AI CLI 工具社区动态日报 2026-03-03 duanyytop/agents-radar#60

Closed

anowardear062-svg approved these changes Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice): add voice mode proof-of-concept with Live API integration#20923

feat(voice): add voice mode proof-of-concept with Live API integration#20923
himanshu748 wants to merge 1 commit intogoogle-gemini:mainfrom
himanshu748:feat/voice-mode-poc

himanshu748 commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

anowardear062-svg commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

himanshu748 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Core module (packages/core/src/voice/)

CLI integration (packages/cli/)

What's NOT included (future work)

Architecture decisions

Testing

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

anowardear062-svg commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

himanshu748 commented Mar 3, 2026 •

edited

Loading

Core module (`packages/core/src/voice/`)

CLI integration (`packages/cli/`)