Conversation
There was a problem hiding this comment.
Pull request overview
该 PR 新增独立语音网关子模块 cmd/picoclaw-voice,打通 ASR → LLM(流式)→ TTS(流式) 的端到端语音对话链路,并在主工程内补齐 ASR/TTS provider 抽象与 OpenAI-compat LLM 的流式输出能力,以支持语音场景的低延迟播放与“thinking”事件通知。
Changes:
- 新增 ASR/TTS 抽象接口与 doubao / funasr / fishspeech provider,实现流式识别与合成
- 新增
picoclaw-voiceWebSocket 网关:握手协商音频格式、VAD 控制、流式 LLM 断句与 TTS 并发流水线 - 为 OpenAI-compat provider 增加
ChatStream,并在 AgentLoop 中新增流式 loop 入口
Reviewed changes
Copilot reviewed 29 out of 32 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/tts/provider.go | 新增 TTS Provider 抽象与工厂注册机制 |
| pkg/tts/ogg.go | 提供 Ogg Opus 按 lacing 规则解包为 Opus 包的工具函数 |
| pkg/tts/fishspeech/provider.go | Fish Speech HTTP TTS(PCM 流式)provider |
| pkg/tts/doubao/provider.go | 豆包 TTS WebSocket(二进制协议,Ogg Opus 流式)provider |
| pkg/providers/streaming.go | 新增 LLM StreamingProvider 接口定义 |
| pkg/providers/openai_compat/streaming.go | OpenAI-compat 的 SSE streaming 解析与 ChatStream 实现 |
| pkg/providers/http_provider.go | HTTPProvider 透传 ChatStream 到 openai_compat.Provider |
| pkg/asr/provider.go | 新增 ASR Provider 抽象、Streaming/Realtime 扩展接口与 session 抽象 |
| pkg/asr/funasr/provider.go | FunASR WebSocket 实时 ASR provider |
| pkg/asr/doubao/provider.go | 豆包 ASR WebSocket(二进制协议)provider,含 realtime session |
| pkg/asr/doubao/provider_test.go | 豆包 ASR 协议帧构造/解析单测 |
| pkg/agent/stream.go | 新增流式 AgentLoop(对 voice 场景输出 token 回调) |
| go.mod | 引入 pion/opus(间接)用于 Ogg/Opus 处理 |
| go.sum | 依赖校验和更新 |
| docs/channels/xiaozhi/README.zh.md | 新增 xiaozhi(picoclaw-voice)协议与时序文档 |
| docker/docker-compose.voice.yml | 增加 picoclaw + picoclaw-voice 的 compose 部署示例 |
| docker/docker-compose.asr-tts.yml | 增加 FunASR + Fish Speech 本地推理栈 compose |
| cmd/picoclaw-voice/protocol.go | 定义网关与客户端的消息协议结构体/构造器 |
| cmd/picoclaw-voice/main.go | 网关入口:env 配置、provider 初始化、WebSocket 监听 |
| cmd/picoclaw-voice/handler.go | 核心会话逻辑:设备注册、ASR/LLM/TTS 并发流水线、thinking 过滤 |
| cmd/picoclaw-voice/handler_test.go | 断句与 thinking 过滤逻辑单测 |
| cmd/picoclaw-voice/go.mod | 子模块 go.mod(嵌套模块) |
| cmd/picoclaw-voice/go.sum | 子模块依赖校验和 |
| cmd/picoclaw-voice/demo/requirements.txt | Python demo 依赖 |
| cmd/picoclaw-voice/demo/client.py | Python demo:握手协商、录音/文件模式、opus/pcm 播放 |
| cmd/picoclaw-voice/demo/README.md | demo 使用说明 |
| cmd/picoclaw-voice/README.md | 网关使用说明与环境变量文档 |
| cmd/picoclaw-voice/Dockerfile | 网关镜像构建(多阶段,scratch 运行时) |
| cmd/picoclaw-voice/.env.example | 网关 env 配置模板 |
| .gitignore | 忽略 picoclaw-voice 二进制与新增配置文件模式 |
| .github/copilot-instructions.md | 新增仓库内 Copilot 指南 |
| .dockerignore | 忽略 picoclaw-voice 二进制与 .env 进入 build context |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| #### hello — 握手响应 | ||
|
|
||
| ```json | ||
| { | ||
| "type": "hello", | ||
| "version": 3, | ||
| "transport": "websocket", | ||
| "session_id": "uuid-v4", | ||
| "audio_params": { | ||
| "format": "opus", | ||
| "sample_rate": 16000, | ||
| "channels": 1, | ||
| "frame_duration": 60 | ||
| } | ||
| } |
| func (s *session) handleAudio(data []byte) { | ||
| // 轻量校验帧格式(格式已在 hello 阶段协商,防御客户端实现缺陷) | ||
| if !s.validateAudioFrame(data) { | ||
| return | ||
| } | ||
| frame := make([]byte, len(data)) | ||
| copy(frame, data) | ||
| if s.asrFeedCh != nil { | ||
| // 实时模式:直接推送给 ASR,不缓冲 | ||
| select { | ||
| case s.asrFeedCh <- asrChunk{frame: frame}: | ||
| default: | ||
| log.Printf("picoclaw-voice: asr feed channel full, dropping frame") | ||
| } | ||
| } else { | ||
| // 批量模式:累积到 audioBuf,等 VAD end 后一次性 ASR | ||
| s.audioBuf = append(s.audioBuf, frame) | ||
| } |
| requestBody := map[string]any{ | ||
| "model": model, | ||
| "messages": serializeMessages(messages), | ||
| "stream": true, | ||
| } | ||
| if len(tools) > 0 { | ||
| requestBody["tools"] = tools | ||
| requestBody["tool_choice"] = "auto" | ||
| } | ||
| if maxTokens, ok := asInt(options["max_tokens"]); ok { | ||
| fieldName := p.maxTokensField | ||
| if fieldName == "" { | ||
| lm := strings.ToLower(model) | ||
| if strings.Contains(lm, "glm") || strings.Contains(lm, "o1") || strings.Contains(lm, "gpt-5") { | ||
| fieldName = "max_completion_tokens" | ||
| } else { | ||
| fieldName = "max_tokens" | ||
| } | ||
| } | ||
| requestBody[fieldName] = maxTokens | ||
| } | ||
| if temperature, ok := asFloat(options["temperature"]); ok { | ||
| lm := strings.ToLower(model) | ||
| if strings.Contains(lm, "kimi") && strings.Contains(lm, "k2") { | ||
| requestBody["temperature"] = 1.0 | ||
| } else { | ||
| requestBody["temperature"] = temperature | ||
| } | ||
| } | ||
|
|
||
| jsonData, err := json.Marshal(requestBody) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to marshal request: %w", err) | ||
| } | ||
|
|
||
| req, err := http.NewRequestWithContext(ctx, http.MethodPost, p.apiBase+"/chat/completions", bytes.NewReader(jsonData)) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create request: %w", err) | ||
| } | ||
| req.Header.Set("Content-Type", "application/json") | ||
| if p.apiKey != "" { | ||
| req.Header.Set("Authorization", "Bearer "+p.apiKey) | ||
| } | ||
|
|
||
| // Use a client without a read timeout — context handles cancellation. | ||
| streamClient := &http.Client{Transport: p.httpClient.Transport} | ||
| resp, err := streamClient.Do(req) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to send request: %w", err) | ||
| } | ||
| defer resp.Body.Close() | ||
|
|
||
| if resp.StatusCode != http.StatusOK { | ||
| body, _ := io.ReadAll(io.LimitReader(resp.Body, 256)) | ||
| return nil, fmt.Errorf("API request failed: status=%d body=%s", resp.StatusCode, responsePreview(body, 128)) | ||
| } |
| toolCalls := make([]ToolCall, 0, len(builders)) | ||
| for i := 0; i < len(builders); i++ { | ||
| b, ok := builders[i] | ||
| if !ok { | ||
| break | ||
| } | ||
| toolCalls = append(toolCalls, ToolCall{ | ||
| ID: b.id, | ||
| Type: b.callType, | ||
| Name: b.name, | ||
| Function: &FunctionCall{ | ||
| Name: b.name, | ||
| Arguments: b.args.String(), | ||
| }, | ||
| }) | ||
| } |
| body, _ := json.Marshal(payload) | ||
| req, err := http.NewRequest(http.MethodPost, p.apiBase+"/v1/tts", bytes.NewReader(body)) | ||
| if err != nil { |
| // ctx 取消时关闭连接,使 ReadMessage 立即解除阻塞。 | ||
| go func() { | ||
| <-ctx.Done() | ||
| conn.Close() | ||
| }() | ||
|
|
| // 优先从可执行文件所在目录加载 .env,回退到当前工作目录。 | ||
| // 已有同名环境变量时跳过(不覆盖),即 Docker environment: 覆盖 > .env。 | ||
| if exe, err := os.Executable(); err == nil { | ||
| loadDotEnv(filepath.Join(filepath.Dir(exe), ".env")) | ||
| } else { | ||
| loadDotEnv(".env") | ||
| } |
| s.processStreamSpeech(pipeCtx, sess, feedCh, time.Now(), turnID, memoryID) | ||
| // 流水线结束,清理引用。handleAudio 看到 nil 后切换到批量缓冲模式。 | ||
| // 轻微竞争可接受:最多丢个别帧,下一轮 listen.start 会重新赋值。 | ||
| s.asrSess = nil | ||
| s.asrFeedCh = nil |
| // ctx 取消时关闭连接,使 ReadMessage 立即返回 | ||
| go func() { | ||
| <-ctx.Done() | ||
| conn.Close() | ||
| }() |
| upgrader := websocket.Upgrader{ | ||
| CheckOrigin: func(r *http.Request) bool { return true }, | ||
| } |
独立模块 cmd/picoclaw-voice,实现端到端语音对话链路(ASR → LLM → TTS)。 协议层基于 xiaozhi WebSocket 协议,做了适当扩展。 - 流式 ASR:豆包 / FunASR(本地) - 流式 TTS:豆包 / Fish Speech(本地) - Docker Compose 一键部署本地 ASR + TTS 服务 - Python demo 客户端用于联调
f6f0c97 to
350f5a2
Compare
|
Nice, we've also recently discussed integrating ASR+TTS functionality internally. I think there are still some TODO to improve:
|
Thanks for your feedback
|
|
666 ,支持,做了我想做的,牛啊 |
|
#1648 has a more detailed proposal. |
独立子模块
cmd/picoclaw-voice,实现完整端到端语音对话链路