[RFC] Architecture Proposal: Hands-Free Multimodal Voice Mode (GSoC 2026) #20456

Vivek1106-04 · 2026-02-26T16:57:45Z

Vivek1106-04
Feb 26, 2026

I am Vivek , a opensource contributor to google owned codebase(Cirq, Kerashub), My recents works in Cirq (#7810 #7843 #7790) and in kerashub (#2525) . I'm preparing a GSoC 2026 proposal for Project #11 and wanted share my way of approach to this large project before submitting the idea.

1. Motivation and Problem Statement

The Gemini CLI currently supports only text-based interaction, requiring developers to type queries and read responses.

Several community PRs (#1982,#6929, #18499) and the voice-mode MCP attempt to address this gap, but they all share a fundamental architectural limitation: they pipe audio through Whisper/OpenAI STT, inject the resulting text into the CLI prompt, and read responses via a separate TTS service. This creates a multi-layered system with inherent latency, external API dependencies, and no support for true conversational barge-in.

The GSoC project description explicitly requires "Gemini's native multimodal audio capabilities" — specifically client.aio.live.connect() with BidiGenerateContent streaming. This is a fundamentally different architecture that eliminates the intermediate transcription step and enables native interruption (barge-in) support.

2. Technical Architecture

2.1 Core Service: VoiceModeService

The voice functionality will be implemented as a dedicated service in packages/core/src/voice/VoiceModeService.ts. This service maintains a completely separate Live API WebSocket session from the existing HTTP-based GeminiClient, ensuring isolation of concerns and preventing interference with the CLI's standard chat functionality.

The service will use:

Model: gemini-2.0-flash-native-audio-preview(subject to change)
Input: 16-bit PCM at 16kHz sample rate
Output: 24kHz audio playback
Protocol: WebSocket-based bidirectional streaming via @google/genai client's Live API

This was intentionally decoupled from the existing GeminiClient because the Live API session has fundamentally different semantics (persistent WebSocket connection, audio-only content, different error handling) compared to the HTTP REST API.

2.2 Audio I/O Layer

Rather than relying on shell-based audio tools like SoX/ALSA that introduce brittle child process management, this implementation will use naudiodon, a Node-API (N-API) binding to PortAudio. This approach offers several advantages:

Stable ABI: N-API provides ABI stability across Node.js versions, eliminating the version compatibility issues that plague native module dependencies
Low latency: Direct memory access to audio buffers for real-time streaming
Cross-platform: Works on macOS, Linux, and Windows with a unified API
No shell-exec: Avoids the fragile subprocess pattern that would complicate CI/CD pipelines

The audio layer in packages/core/src/voice/audioIO.ts will handle:

Microphone capture at the required 16kHz sample rate
Speaker output at 24kHz
Device enumeration and selection
Audio buffer management for smooth streaming

2.3 Voice Activation Modes

The implementation will support three activation modes, delivered in phases:

Push-to-Talk (Phase 1 - MVP): A configurable hotkey (default: Cmd+Shift+V on macOS, Ctrl+Shift+V on Linux/Windows) activates recording while pressed. This is the simplest implementation with the most predictable behavior, making it ideal for an MVP.
Auto-VAD Mode (Phase 2): Leverages the Live API's built-in server-side Voice Activity Detection rather than shipping a client-side Silero model or energy-threshold logic. When the server detects speech cessation, it automatically ends the input turn and begins generating a response.
Wake Word Mode (Phase 3 - if timeline permits): "Hey Gemini" activation using a lightweight client-side wake word detector. This requires careful consideration of native dependency management and power consumption.

2.4 Interruption Support (Barge-in)

True conversational agents require the ability to interrupt. When the agent is speaking and the user starts talking, the client sends a BidiGenerateContentClientContent message with turn_complete: false to the Live API. This aligns with the Live API's native interrupt model rather than using a separate cancel_generation frame. The server then:

Stops audio output
Transitions to receiving user input
Continues the conversation seamlessly

This is a key differentiator from the Whisper-wrapper approaches that cannot interrupt mid-response.

3. Integration Points with Existing Codebase

3.1 Slash Command Entry Point

The /voice slash command will serve as the primary entry point for voice mode, registered through the existing BuiltinCommandLoader in packages/cli/src/ui/hooks/slashCommandProcessor.ts. A new file packages/cli/src/ui/commands/voiceCommand.ts will implement the command interface.

When invoked:

/voice toggles voice mode on/off
/voice start explicitly starts voice mode
/voice stop explicitly stops voice mode

3.2 Settings Integration

Voice configuration will be added to the existing settings schema in packages/cli/src/config/settingsSchema.ts. This allows users to configure:

Default activation mode (push-to-talk, VAD, wake-word)
Hotkey binding
Input/output audio device selection
TTS voice selection
Output volume
Interruption sensitivity

The settings follow the same pattern as existing configurations like general.vimMode and terminal.shell.

3.3 Tool Integration

The existing ToolRegistry in packages/core/src/tools/tool-registry.ts will be registered on the Live session's function calling interface. This enables file operations, shell commands, grep, glob, and all other tools during voice conversations.

However, tool outputs must be formatted for speech. A new packages/core/src/voice/responseFormatter.ts will:

Strip markdown formatting from output
Abbreviate file paths
Truncate long outputs with "and X more items..."
Simplify technical JSON and error messages
Provide spoken-friendly summaries

3.4 UI Components

Voice mode UI will be implemented in packages/cli/src/ui/components/VoiceMode.tsx using Ink's component system:

Animated waveform visualizer: Shows real-time audio levels using React state and Ink's <Box> component
State indicators: (Listening), (Speaking), (Processing), (Complete)
Transcript display: Shows live transcription alongside the waveform
Interrupt button: Allows user to stop agent response

The existing UI structure in packages/cli/src/ui/AppContainer.tsx will be extended with a VoiceContext provider to manage voice state across components.

4. Why I Choose This Architecture

Feature	Whisper Wrapper Approach	This Proposal
Latency	STT → Text → LLM → TTS (3-4 hops)	Native audio streaming (1 hop)
Dependencies	Separate OpenAI API key for Whisper	Single Gemini API
Barge-in	Impossible without complex state management	Native `turn_complete: false`
Audio Quality	Depends on Whisper model	Gemini-native optimization
Bidirectional	Pseudo-bidirectional at best	True bidirectional streaming
Code Complexity	Multiple service coordination	Single integrated service

5. Questions for Reviewers

Question 1: Sandboxed Audio Access

Audio hardware access (microphone and speakers) requires exemption from macOS Seatbelt and Docker sandboxing, since /dev/snd and microphone devices are blocked by default in containerized environments. Should this be handled via:

(a) A new sandbox capability flag (e.g., --sandbox-with-audio) that configures appropriate entitlements, or

(b) Documentation stating that voice mode requires --no-sandbox or a custom sandbox profile with audio device access (--device /dev/snd on Linux Docker)?

Option (b) is simpler but creates a security trade-off. Option (a) requires changes to the sandbox infrastructure but maintains a better security posture. Which approach aligns better with the project's security philosophy?

Question 2: Dependency Philosophy

The codebase currently uses node-pty as an optional dependency for terminal emulation, with graceful degradation when unavailable. Should voice mode follow the same pattern with naudiodon:

(a) Make naudiodon fully optional, with voice mode gracefully disabled when audio libraries are unavailable, or

(b) Make naudiodon a required dependency, treating voice mode as a first-class feature?

Option (a) follows the existing pattern but creates potential user confusion when voice features don't work. Option (b) provides a more consistent experience but increases the installation complexity.

Question 3: Tool Output Verbosity

When the agent needs to execute tools during voice conversation (e.g., reading a file, running tests), how should verbose tool outputs be handled?

(a) Full reading: Convert all tool output to speech, potentially reading multiple paragraphs of file contents or test results

(b) Summarized: Use a secondary LLM call to summarize tool outputs before speech synthesis

(c) Selective: Always provide a one-sentence summary, then ask if user wants details ("I found the bug in src/utils/auth.ts — a missing null check on line 42. Want me to read the full context?")

Option (c) feels most natural for conversation but adds latency. What is the expected user experience for tool execution during voice mode?

Question 4: Model Selection

The Live API currently requires gemini-2.5-flash-native-audio-preview. Should voice mode:

(a) Use this model exclusively for voice interactions, regardless of the user's configured model for text chat, or

(b) Attempt to use the user's selected model if it supports native audio, falling back to the preview model?

If (b), how should we communicate the model difference to users who have customized their configuration?

6. Risk Mitigation

Risk	Impact	Mitigation Strategy
Live API availability	Feature unusable	Graceful fallback to text mode with user notification
Audio hardware access	Platform-specific failures	Clear error messages with platform-specific guidance
Latency issues	Poor UX	Audio buffering optimization, progressive UI feedback
Breaking changes in @google/genai	Build failures	Pin to specific version, comprehensive test coverage
Cross-platform audio differences	Inconsistent behavior	Device enumeration with sensible defaults per platform

Looking forward for your review , brutal course corrections, or thoughts on the design. Thank you

jacob314 · 2026-03-04T20:24:23Z

jacob314
Mar 4, 2026
Collaborator

Please take a look at the work going on by @fayerman-source on #18067
Would be good to ensure your work is aligned with that.
You are welcome to also add comments to #18499 if there are parts misaligned with what you are trying to do.

0 replies

Vivek1106-04 · 2026-03-05T15:35:02Z

Vivek1106-04
Mar 5, 2026
Author

Thanks for the pointer to #18499!.
I've just completed a deep dive into your implementation and left few comments regarding architectural alignment. Summary of my findings:

Your Approach (REST-based):

Excellent for single-turn voice commands
Audio File → Upload → Transcribe workflow
Shell dependencies (sox/arecord)

My GSoC Proposal (Native Live API):

Focuses on real-time bidirectional voice conversation
Streaming raw PCM chunks via WebSocket (<500ms latency)
Native barge-in support (interrupting mid-response)
24kHz audio output for voice responses
No shell dependencies (naudiodon/N-API bindings)

The key distinction is that @fayerman-source implementation provides voice INPUT (transcription only), while the Live API approach enables a complete hands-free coding experience with both input AND output.

I'm not suggesting one is better—both have their place. The REST approach is simpler and works offline with Whisper. The Live API approach enables the "JARVIS-like" experience described in the GSoC project goals.

Would love to hear your thoughts on which direction the team prefers for the long-term vision!

0 replies

aniruddhaadak80 · 2026-03-09T18:22:37Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the proposal gets stronger when the orchestration boundary is treated as the primary problem and the audio transport as an implementation detail under that boundary. Session transitions, interruption semantics, tool execution during speech, and fallback behavior are what will decide whether this feels natural in practice. If those pieces are stable, the exact input and output stack becomes much easier to evolve later.

0 replies

Vivek1106-04 · 2026-03-10T02:42:29Z

Vivek1106-04
Mar 10, 2026
Author

@aniruddhaadak80 You raise a point I hadn't emphasized enough in my proposal—and you're completely right.
The real complexity isn't the audio transport (sox vs naudiodon). It's the orchestration layer that sits above it:

When does the user finish speaking? (VAD, push-to-talk, wake word)
What happens when they interrupt mid-sentence?
How do tools execute while the agent is still talking?

That's exactly why I'm drawn to the Live API—not for the audio itself, but because it gives me native primitives for these problems:

turn_complete: false for barge-in
Function calling over WebSocket for tools mid-conversation
Server-side VAD for automatic turn detection

Get the orchestration right, and the audio layer becomes swappable:

Plan A: sox pipes → Live API (zero-install)
Plan B: naudiodon → Live API (lower latency)

Same orchestration, different transport underneath.

0 replies

Vivek1106-04 · 2026-03-17T03:44:45Z

Vivek1106-04
Mar 17, 2026
Author

@jacob314 , @bdmorgan ! does the new list of projects include Hands-Free Multimodal Voice Mode , thinking of submitting a final proposal today - need clarification . Thank You

0 replies

yyovil · 2026-03-17T07:27:20Z

yyovil
Mar 17, 2026

what if we use the sandbox just like nodes in OpenClaw?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Architecture Proposal: Hands-Free Multimodal Voice Mode (GSoC 2026) #20456

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[RFC] Architecture Proposal: Hands-Free Multimodal Voice Mode (GSoC 2026) #20456

Uh oh!

Vivek1106-04 Feb 26, 2026

1. Motivation and Problem Statement

2. Technical Architecture

2.1 Core Service: VoiceModeService

2.2 Audio I/O Layer

2.3 Voice Activation Modes

2.4 Interruption Support (Barge-in)

3. Integration Points with Existing Codebase

3.1 Slash Command Entry Point

3.2 Settings Integration

3.3 Tool Integration

3.4 UI Components

4. Why I Choose This Architecture

5. Questions for Reviewers

Question 1: Sandboxed Audio Access

Question 2: Dependency Philosophy

Question 3: Tool Output Verbosity

Question 4: Model Selection

6. Risk Mitigation

Replies: 6 comments

Uh oh!

jacob314 Mar 4, 2026 Collaborator

Uh oh!

Vivek1106-04 Mar 5, 2026 Author

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

Vivek1106-04 Mar 10, 2026 Author

Uh oh!

Vivek1106-04 Mar 17, 2026 Author

Uh oh!

yyovil Mar 17, 2026

Vivek1106-04
Feb 26, 2026

jacob314
Mar 4, 2026
Collaborator

Vivek1106-04
Mar 5, 2026
Author

aniruddhaadak80
Mar 9, 2026

Vivek1106-04
Mar 10, 2026
Author

Vivek1106-04
Mar 17, 2026
Author

yyovil
Mar 17, 2026