[RFC] Architecture Proposal: Hands-Free Multimodal Voice Mode (GSoC 2026) #20456
Replies: 6 comments
-
|
Please take a look at the work going on by @fayerman-source on #18067 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the pointer to #18499!. Your Approach (REST-based):
My GSoC Proposal (Native Live API):
The key distinction is that @fayerman-source implementation provides voice INPUT (transcription only), while the Live API approach enables a complete hands-free coding experience with both input AND output. I'm not suggesting one is better—both have their place. The REST approach is simpler and works offline with Whisper. The Live API approach enables the "JARVIS-like" experience described in the GSoC project goals. Would love to hear your thoughts on which direction the team prefers for the long-term vision! |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, the proposal gets stronger when the orchestration boundary is treated as the primary problem and the audio transport as an implementation detail under that boundary. Session transitions, interruption semantics, tool execution during speech, and fallback behavior are what will decide whether this feels natural in practice. If those pieces are stable, the exact input and output stack becomes much easier to evolve later. |
Beta Was this translation helpful? Give feedback.
-
|
@aniruddhaadak80 You raise a point I hadn't emphasized enough in my proposal—and you're completely right.
That's exactly why I'm drawn to the Live API—not for the audio itself, but because it gives me native primitives for these problems:
Get the orchestration right, and the audio layer becomes swappable:
Same orchestration, different transport underneath. |
Beta Was this translation helpful? Give feedback.
-
|
@jacob314 , @bdmorgan ! does the new list of projects include |
Beta Was this translation helpful? Give feedback.
-
what if we use the sandbox just like nodes in OpenClaw? |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Hello, @bdmorgan !
I am Vivek , a opensource contributor to google owned codebase(Cirq, Kerashub), My recents works in Cirq (#7810 #7843 #7790) and in kerashub (#2525) . I'm preparing a GSoC 2026 proposal for Project #11 and wanted share my way of approach to this large project before submitting the idea.
1. Motivation and Problem Statement
The Gemini CLI currently supports only text-based interaction, requiring developers to type queries and read responses.
Several community PRs (#1982,#6929, #18499) and the voice-mode MCP attempt to address this gap, but they all share a fundamental architectural limitation: they pipe audio through Whisper/OpenAI STT, inject the resulting text into the CLI prompt, and read responses via a separate TTS service. This creates a multi-layered system with inherent latency, external API dependencies, and no support for true conversational barge-in.
The GSoC project description explicitly requires "Gemini's native multimodal audio capabilities" — specifically
client.aio.live.connect()withBidiGenerateContentstreaming. This is a fundamentally different architecture that eliminates the intermediate transcription step and enables native interruption (barge-in) support.2. Technical Architecture
2.1 Core Service: VoiceModeService
The voice functionality will be implemented as a dedicated service in
packages/core/src/voice/VoiceModeService.ts. This service maintains a completely separate Live API WebSocket session from the existing HTTP-basedGeminiClient, ensuring isolation of concerns and preventing interference with the CLI's standard chat functionality.The service will use:
gemini-2.0-flash-native-audio-preview(subject to change)@google/genaiclient's Live APIThis was intentionally decoupled from the existing
GeminiClientbecause the Live API session has fundamentally different semantics (persistent WebSocket connection, audio-only content, different error handling) compared to the HTTP REST API.2.2 Audio I/O Layer
Rather than relying on shell-based audio tools like SoX/ALSA that introduce brittle child process management, this implementation will use naudiodon, a Node-API (N-API) binding to PortAudio. This approach offers several advantages:
The audio layer in
packages/core/src/voice/audioIO.tswill handle:2.3 Voice Activation Modes
The implementation will support three activation modes, delivered in phases:
Push-to-Talk (Phase 1 - MVP): A configurable hotkey (default:
Cmd+Shift+Von macOS,Ctrl+Shift+Von Linux/Windows) activates recording while pressed. This is the simplest implementation with the most predictable behavior, making it ideal for an MVP.Auto-VAD Mode (Phase 2): Leverages the Live API's built-in server-side Voice Activity Detection rather than shipping a client-side Silero model or energy-threshold logic. When the server detects speech cessation, it automatically ends the input turn and begins generating a response.
Wake Word Mode (Phase 3 - if timeline permits): "Hey Gemini" activation using a lightweight client-side wake word detector. This requires careful consideration of native dependency management and power consumption.
2.4 Interruption Support (Barge-in)
True conversational agents require the ability to interrupt. When the agent is speaking and the user starts talking, the client sends a
BidiGenerateContentClientContentmessage withturn_complete: falseto the Live API. This aligns with the Live API's native interrupt model rather than using a separatecancel_generationframe. The server then:This is a key differentiator from the Whisper-wrapper approaches that cannot interrupt mid-response.
3. Integration Points with Existing Codebase
3.1 Slash Command Entry Point
The
/voiceslash command will serve as the primary entry point for voice mode, registered through the existingBuiltinCommandLoaderinpackages/cli/src/ui/hooks/slashCommandProcessor.ts. A new filepackages/cli/src/ui/commands/voiceCommand.tswill implement the command interface.When invoked:
/voicetoggles voice mode on/off/voice startexplicitly starts voice mode/voice stopexplicitly stops voice mode3.2 Settings Integration
Voice configuration will be added to the existing settings schema in
packages/cli/src/config/settingsSchema.ts. This allows users to configure:The settings follow the same pattern as existing configurations like
general.vimModeandterminal.shell.3.3 Tool Integration
The existing
ToolRegistryinpackages/core/src/tools/tool-registry.tswill be registered on the Live session's function calling interface. This enables file operations, shell commands, grep, glob, and all other tools during voice conversations.However, tool outputs must be formatted for speech. A new
packages/core/src/voice/responseFormatter.tswill:3.4 UI Components
Voice mode UI will be implemented in
packages/cli/src/ui/components/VoiceMode.tsxusing Ink's component system:<Box>componentThe existing UI structure in
packages/cli/src/ui/AppContainer.tsxwill be extended with aVoiceContextprovider to manage voice state across components.4. Why I Choose This Architecture
turn_complete: false5. Questions for Reviewers
Question 1: Sandboxed Audio Access
Audio hardware access (microphone and speakers) requires exemption from macOS Seatbelt and Docker sandboxing, since
/dev/sndand microphone devices are blocked by default in containerized environments. Should this be handled via:(a) A new sandbox capability flag (e.g.,
--sandbox-with-audio) that configures appropriate entitlements, or(b) Documentation stating that voice mode requires
--no-sandboxor a custom sandbox profile with audio device access (--device /dev/sndon Linux Docker)?Option (b) is simpler but creates a security trade-off. Option (a) requires changes to the sandbox infrastructure but maintains a better security posture. Which approach aligns better with the project's security philosophy?
Question 2: Dependency Philosophy
The codebase currently uses
node-ptyas an optional dependency for terminal emulation, with graceful degradation when unavailable. Should voice mode follow the same pattern withnaudiodon:(a) Make naudiodon fully optional, with voice mode gracefully disabled when audio libraries are unavailable, or
(b) Make naudiodon a required dependency, treating voice mode as a first-class feature?
Option (a) follows the existing pattern but creates potential user confusion when voice features don't work. Option (b) provides a more consistent experience but increases the installation complexity.
Question 3: Tool Output Verbosity
When the agent needs to execute tools during voice conversation (e.g., reading a file, running tests), how should verbose tool outputs be handled?
(a) Full reading: Convert all tool output to speech, potentially reading multiple paragraphs of file contents or test results
(b) Summarized: Use a secondary LLM call to summarize tool outputs before speech synthesis
(c) Selective: Always provide a one-sentence summary, then ask if user wants details ("I found the bug in src/utils/auth.ts — a missing null check on line 42. Want me to read the full context?")
Option (c) feels most natural for conversation but adds latency. What is the expected user experience for tool execution during voice mode?
Question 4: Model Selection
The Live API currently requires
gemini-2.5-flash-native-audio-preview. Should voice mode:(a) Use this model exclusively for voice interactions, regardless of the user's configured model for text chat, or
(b) Attempt to use the user's selected model if it supports native audio, falling back to the preview model?
If (b), how should we communicate the model difference to users who have customized their configuration?
6. Risk Mitigation
Looking forward for your review , brutal course corrections, or thoughts on the design. Thank you
Beta Was this translation helpful? Give feedback.
All reactions