新增asr和tts功能 by kkroid · Pull Request #1642 · sipeed/picoclaw

kkroid · 2026-03-16T09:13:58Z

独立子模块 cmd/picoclaw-voice，实现完整端到端语音对话链路

协议层参考 xiaozhi 私有 WebSocket 协议设计，未绑定具体硬件。
流式ASR: 豆包 / FunASR 本地离线版本
流式TTS: 豆包 / Fish Speech 本地离线版本

CLAassistant · 2026-03-16T09:15:58Z

All committers have signed the CLA.

Copilot

Pull request overview

该 PR 新增独立语音网关子模块 cmd/picoclaw-voice，打通 ASR → LLM（流式）→ TTS（流式） 的端到端语音对话链路，并在主工程内补齐 ASR/TTS provider 抽象与 OpenAI-compat LLM 的流式输出能力，以支持语音场景的低延迟播放与“thinking”事件通知。

Changes:

新增 ASR/TTS 抽象接口与 doubao / funasr / fishspeech provider，实现流式识别与合成
新增 picoclaw-voice WebSocket 网关：握手协商音频格式、VAD 控制、流式 LLM 断句与 TTS 并发流水线
为 OpenAI-compat provider 增加 ChatStream，并在 AgentLoop 中新增流式 loop 入口

Reviewed changes

Copilot reviewed 29 out of 32 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
pkg/tts/provider.go	新增 TTS Provider 抽象与工厂注册机制
pkg/tts/ogg.go	提供 Ogg Opus 按 lacing 规则解包为 Opus 包的工具函数
pkg/tts/fishspeech/provider.go	Fish Speech HTTP TTS（PCM 流式）provider
pkg/tts/doubao/provider.go	豆包 TTS WebSocket（二进制协议，Ogg Opus 流式）provider
pkg/providers/streaming.go	新增 LLM StreamingProvider 接口定义
pkg/providers/openai_compat/streaming.go	OpenAI-compat 的 SSE streaming 解析与 `ChatStream` 实现
pkg/providers/http_provider.go	HTTPProvider 透传 `ChatStream` 到 openai_compat.Provider
pkg/asr/provider.go	新增 ASR Provider 抽象、Streaming/Realtime 扩展接口与 session 抽象
pkg/asr/funasr/provider.go	FunASR WebSocket 实时 ASR provider
pkg/asr/doubao/provider.go	豆包 ASR WebSocket（二进制协议）provider，含 realtime session
pkg/asr/doubao/provider_test.go	豆包 ASR 协议帧构造/解析单测
pkg/agent/stream.go	新增流式 AgentLoop（对 voice 场景输出 token 回调）
go.mod	引入 pion/opus（间接）用于 Ogg/Opus 处理
go.sum	依赖校验和更新
docs/channels/xiaozhi/README.zh.md	新增 xiaozhi（picoclaw-voice）协议与时序文档
docker/docker-compose.voice.yml	增加 picoclaw + picoclaw-voice 的 compose 部署示例
docker/docker-compose.asr-tts.yml	增加 FunASR + Fish Speech 本地推理栈 compose
cmd/picoclaw-voice/protocol.go	定义网关与客户端的消息协议结构体/构造器
cmd/picoclaw-voice/main.go	网关入口：env 配置、provider 初始化、WebSocket 监听
cmd/picoclaw-voice/handler.go	核心会话逻辑：设备注册、ASR/LLM/TTS 并发流水线、thinking 过滤
cmd/picoclaw-voice/handler_test.go	断句与 thinking 过滤逻辑单测
cmd/picoclaw-voice/go.mod	子模块 go.mod（嵌套模块）
cmd/picoclaw-voice/go.sum	子模块依赖校验和
cmd/picoclaw-voice/demo/requirements.txt	Python demo 依赖
cmd/picoclaw-voice/demo/client.py	Python demo：握手协商、录音/文件模式、opus/pcm 播放
cmd/picoclaw-voice/demo/README.md	demo 使用说明
cmd/picoclaw-voice/README.md	网关使用说明与环境变量文档
cmd/picoclaw-voice/Dockerfile	网关镜像构建（多阶段，scratch 运行时）
cmd/picoclaw-voice/.env.example	网关 env 配置模板
.gitignore	忽略 picoclaw-voice 二进制与新增配置文件模式
.github/copilot-instructions.md	新增仓库内 Copilot 指南
.dockerignore	忽略 picoclaw-voice 二进制与 .env 进入 build context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+#### hello — 握手响应
+
+```json
+{
+  "type": "hello",
+  "version": 3,
+  "transport": "websocket",
+  "session_id": "uuid-v4",
+  "audio_params": {
+    "format": "opus",
+    "sample_rate": 16000,
+    "channels": 1,
+    "frame_duration": 60
+  }
+}


+func (s *session) handleAudio(data []byte) {
+	// 轻量校验帧格式（格式已在 hello 阶段协商，防御客户端实现缺陷）
+	if !s.validateAudioFrame(data) {
+		return
+	}
+	frame := make([]byte, len(data))
+	copy(frame, data)
+	if s.asrFeedCh != nil {
+		// 实时模式：直接推送给 ASR，不缓冲
+		select {
+		case s.asrFeedCh <- asrChunk{frame: frame}:
+		default:
+			log.Printf("picoclaw-voice: asr feed channel full, dropping frame")
+		}
+	} else {
+		// 批量模式：累积到 audioBuf，等 VAD end 后一次性 ASR
+		s.audioBuf = append(s.audioBuf, frame)
+	}


+	requestBody := map[string]any{
+		"model":    model,
+		"messages": serializeMessages(messages),
+		"stream":   true,
+	}
+	if len(tools) > 0 {
+		requestBody["tools"] = tools
+		requestBody["tool_choice"] = "auto"
+	}
+	if maxTokens, ok := asInt(options["max_tokens"]); ok {
+		fieldName := p.maxTokensField
+		if fieldName == "" {
+			lm := strings.ToLower(model)
+			if strings.Contains(lm, "glm") || strings.Contains(lm, "o1") || strings.Contains(lm, "gpt-5") {
+				fieldName = "max_completion_tokens"
+			} else {
+				fieldName = "max_tokens"
+			}
+		}
+		requestBody[fieldName] = maxTokens
+	}
+	if temperature, ok := asFloat(options["temperature"]); ok {
+		lm := strings.ToLower(model)
+		if strings.Contains(lm, "kimi") && strings.Contains(lm, "k2") {
+			requestBody["temperature"] = 1.0
+		} else {
+			requestBody["temperature"] = temperature
+		}
+	}
+
+	jsonData, err := json.Marshal(requestBody)
+	if err != nil {
+		return nil, fmt.Errorf("failed to marshal request: %w", err)
+	}
+
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, p.apiBase+"/chat/completions", bytes.NewReader(jsonData))
+	if err != nil {
+		return nil, fmt.Errorf("failed to create request: %w", err)
+	}
+	req.Header.Set("Content-Type", "application/json")
+	if p.apiKey != "" {
+		req.Header.Set("Authorization", "Bearer "+p.apiKey)
+	}
+
+	// Use a client without a read timeout — context handles cancellation.
+	streamClient := &http.Client{Transport: p.httpClient.Transport}
+	resp, err := streamClient.Do(req)
+	if err != nil {
+		return nil, fmt.Errorf("failed to send request: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		body, _ := io.ReadAll(io.LimitReader(resp.Body, 256))
+		return nil, fmt.Errorf("API request failed: status=%d body=%s", resp.StatusCode, responsePreview(body, 128))
+	}


+	toolCalls := make([]ToolCall, 0, len(builders))
+	for i := 0; i < len(builders); i++ {
+		b, ok := builders[i]
+		if !ok {
+			break
+		}
+		toolCalls = append(toolCalls, ToolCall{
+			ID:   b.id,
+			Type: b.callType,
+			Name: b.name,
+			Function: &FunctionCall{
+				Name:      b.name,
+				Arguments: b.args.String(),
+			},
+		})
+	}


+	body, _ := json.Marshal(payload)
+	req, err := http.NewRequest(http.MethodPost, p.apiBase+"/v1/tts", bytes.NewReader(body))
+	if err != nil {


+	// ctx 取消时关闭连接，使 ReadMessage 立即解除阻塞。
+	go func() {
+		<-ctx.Done()
+		conn.Close()
+	}()
+


+	// 优先从可执行文件所在目录加载 .env，回退到当前工作目录。
+	// 已有同名环境变量时跳过（不覆盖），即 Docker environment: 覆盖 > .env。
+	if exe, err := os.Executable(); err == nil {
+		loadDotEnv(filepath.Join(filepath.Dir(exe), ".env"))
+	} else {
+		loadDotEnv(".env")
+	}


+						s.processStreamSpeech(pipeCtx, sess, feedCh, time.Now(), turnID, memoryID)
+						// 流水线结束，清理引用。handleAudio 看到 nil 后切换到批量缓冲模式。
+						// 轻微竞争可接受：最多丢个别帧，下一轮 listen.start 会重新赋值。
+						s.asrSess = nil
+						s.asrFeedCh = nil


+	// ctx 取消时关闭连接，使 ReadMessage 立即返回
+	go func() {
+		<-ctx.Done()
+		conn.Close()
+	}()


+	upgrader := websocket.Upgrader{
+		CheckOrigin: func(r *http.Request) bool { return true },
+	}


独立模块 cmd/picoclaw-voice，实现端到端语音对话链路（ASR → LLM → TTS）。协议层基于 xiaozhi WebSocket 协议，做了适当扩展。 - 流式 ASR：豆包 / FunASR（本地） - 流式 TTS：豆包 / Fish Speech（本地） - Docker Compose 一键部署本地 ASR + TTS 服务 - Python demo 客户端用于联调

lxowalle · 2026-03-16T10:34:17Z

Nice, we've also recently discussed integrating ASR+TTS functionality internally. I think there are still some TODO to improve:

Need to finalize the configuration file to allow users to customize their own ASR and TTS models. For example, setting which TTS model to use, how to configure the KEY, how to enable it, etc.
Need to determine how to handle non-streaming scenarios, such as when an audio file is sent through the channel
Integrate ASR+TTS into the gateway

kkroid · 2026-03-16T10:47:44Z

Nice, we've also recently discussed integrating ASR+TTS functionality internally. I think there are still some TODO to improve:

Need to finalize the configuration file to allow users to customize their own ASR and TTS models. For example, setting which TTS model to use, how to configure the KEY, how to enable it, etc.

Need to determine how to handle non-streaming scenarios, such as when an audio file is sent through the channel

Integrate ASR+TTS into the gateway

Thanks for your feedback

Config — Currently using env vars + .env file. Happy to align with whatever config format the team prefers (e.g. extending config.json/config.yaml).
Non-streaming — Will look into supporting audio file input via channels as a follow-up.
Gateway integration — Currently this PR keeps it as a standalone module. I agree that integrating them into the main gateway is a good idea.

opcache · 2026-03-17T03:12:58Z

666 ，支持，做了我想做的，牛啊

lxowalle · 2026-03-18T02:40:20Z

#1648 has a more detailed proposal.

Copilot AI review requested due to automatic review settings March 16, 2026 09:13

Copilot started reviewing on behalf of kkroid March 16, 2026 09:14 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

kkroid force-pushed the feat/local-asr-tts-providers branch from f6f0c97 to 350f5a2 Compare March 16, 2026 09:23

sipeed-bot Bot added type: enhancement New feature or request domain: provider domain: channel go Pull requests that update go code labels Mar 16, 2026

lxowalle mentioned this pull request Mar 18, 2026

[Feature] Adding TTS and ASR Support to PicoClaw #1648

Open

lxowalle closed this Mar 18, 2026

github-actions Bot mentioned this pull request Mar 21, 2026

🦞 OpenClaw 生态日报 2026-03-21 gsscsd/big_model_radar#71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

新增asr和tts功能#1642

新增asr和tts功能#1642
kkroid wants to merge 1 commit intosipeed:mainfrom
kkroid:feat/local-asr-tts-providers

kkroid commented Mar 16, 2026

Uh oh!

CLAassistant commented Mar 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

lxowalle commented Mar 16, 2026

Uh oh!

kkroid commented Mar 16, 2026

Uh oh!

opcache commented Mar 17, 2026

Uh oh!

lxowalle commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kkroid commented Mar 16, 2026

Uh oh!

CLAassistant commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lxowalle commented Mar 16, 2026

Uh oh!

kkroid commented Mar 16, 2026

Uh oh!

opcache commented Mar 17, 2026

Uh oh!

lxowalle commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Mar 16, 2026 •

edited

Loading