Skip to content

新增asr和tts功能#1642

Closed
kkroid wants to merge 1 commit intosipeed:mainfrom
kkroid:feat/local-asr-tts-providers
Closed

新增asr和tts功能#1642
kkroid wants to merge 1 commit intosipeed:mainfrom
kkroid:feat/local-asr-tts-providers

Conversation

@kkroid
Copy link
Copy Markdown

@kkroid kkroid commented Mar 16, 2026

独立子模块 cmd/picoclaw-voice,实现完整端到端语音对话链路

  • 协议层参考 xiaozhi 私有 WebSocket 协议设计,未绑定具体硬件。
  • 流式ASR: 豆包 / FunASR 本地离线版本
  • 流式TTS: 豆包 / Fish Speech 本地离线版本

Copilot AI review requested due to automatic review settings March 16, 2026 09:13
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 16, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 新增独立语音网关子模块 cmd/picoclaw-voice,打通 ASR → LLM(流式)→ TTS(流式) 的端到端语音对话链路,并在主工程内补齐 ASR/TTS provider 抽象与 OpenAI-compat LLM 的流式输出能力,以支持语音场景的低延迟播放与“thinking”事件通知。

Changes:

  • 新增 ASR/TTS 抽象接口与 doubao / funasr / fishspeech provider,实现流式识别与合成
  • 新增 picoclaw-voice WebSocket 网关:握手协商音频格式、VAD 控制、流式 LLM 断句与 TTS 并发流水线
  • 为 OpenAI-compat provider 增加 ChatStream,并在 AgentLoop 中新增流式 loop 入口

Reviewed changes

Copilot reviewed 29 out of 32 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
pkg/tts/provider.go 新增 TTS Provider 抽象与工厂注册机制
pkg/tts/ogg.go 提供 Ogg Opus 按 lacing 规则解包为 Opus 包的工具函数
pkg/tts/fishspeech/provider.go Fish Speech HTTP TTS(PCM 流式)provider
pkg/tts/doubao/provider.go 豆包 TTS WebSocket(二进制协议,Ogg Opus 流式)provider
pkg/providers/streaming.go 新增 LLM StreamingProvider 接口定义
pkg/providers/openai_compat/streaming.go OpenAI-compat 的 SSE streaming 解析与 ChatStream 实现
pkg/providers/http_provider.go HTTPProvider 透传 ChatStream 到 openai_compat.Provider
pkg/asr/provider.go 新增 ASR Provider 抽象、Streaming/Realtime 扩展接口与 session 抽象
pkg/asr/funasr/provider.go FunASR WebSocket 实时 ASR provider
pkg/asr/doubao/provider.go 豆包 ASR WebSocket(二进制协议)provider,含 realtime session
pkg/asr/doubao/provider_test.go 豆包 ASR 协议帧构造/解析单测
pkg/agent/stream.go 新增流式 AgentLoop(对 voice 场景输出 token 回调)
go.mod 引入 pion/opus(间接)用于 Ogg/Opus 处理
go.sum 依赖校验和更新
docs/channels/xiaozhi/README.zh.md 新增 xiaozhi(picoclaw-voice)协议与时序文档
docker/docker-compose.voice.yml 增加 picoclaw + picoclaw-voice 的 compose 部署示例
docker/docker-compose.asr-tts.yml 增加 FunASR + Fish Speech 本地推理栈 compose
cmd/picoclaw-voice/protocol.go 定义网关与客户端的消息协议结构体/构造器
cmd/picoclaw-voice/main.go 网关入口:env 配置、provider 初始化、WebSocket 监听
cmd/picoclaw-voice/handler.go 核心会话逻辑:设备注册、ASR/LLM/TTS 并发流水线、thinking 过滤
cmd/picoclaw-voice/handler_test.go 断句与 thinking 过滤逻辑单测
cmd/picoclaw-voice/go.mod 子模块 go.mod(嵌套模块)
cmd/picoclaw-voice/go.sum 子模块依赖校验和
cmd/picoclaw-voice/demo/requirements.txt Python demo 依赖
cmd/picoclaw-voice/demo/client.py Python demo:握手协商、录音/文件模式、opus/pcm 播放
cmd/picoclaw-voice/demo/README.md demo 使用说明
cmd/picoclaw-voice/README.md 网关使用说明与环境变量文档
cmd/picoclaw-voice/Dockerfile 网关镜像构建(多阶段,scratch 运行时)
cmd/picoclaw-voice/.env.example 网关 env 配置模板
.gitignore 忽略 picoclaw-voice 二进制与新增配置文件模式
.github/copilot-instructions.md 新增仓库内 Copilot 指南
.dockerignore 忽略 picoclaw-voice 二进制与 .env 进入 build context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +110 to +124
#### hello — 握手响应

```json
{
"type": "hello",
"version": 3,
"transport": "websocket",
"session_id": "uuid-v4",
"audio_params": {
"format": "opus",
"sample_rate": 16000,
"channels": 1,
"frame_duration": 60
}
}
Comment on lines +316 to +333
func (s *session) handleAudio(data []byte) {
// 轻量校验帧格式(格式已在 hello 阶段协商,防御客户端实现缺陷)
if !s.validateAudioFrame(data) {
return
}
frame := make([]byte, len(data))
copy(frame, data)
if s.asrFeedCh != nil {
// 实时模式:直接推送给 ASR,不缓冲
select {
case s.asrFeedCh <- asrChunk{frame: frame}:
default:
log.Printf("picoclaw-voice: asr feed channel full, dropping frame")
}
} else {
// 批量模式:累积到 audioBuf,等 VAD end 后一次性 ASR
s.audioBuf = append(s.audioBuf, frame)
}
Comment on lines +35 to +90
requestBody := map[string]any{
"model": model,
"messages": serializeMessages(messages),
"stream": true,
}
if len(tools) > 0 {
requestBody["tools"] = tools
requestBody["tool_choice"] = "auto"
}
if maxTokens, ok := asInt(options["max_tokens"]); ok {
fieldName := p.maxTokensField
if fieldName == "" {
lm := strings.ToLower(model)
if strings.Contains(lm, "glm") || strings.Contains(lm, "o1") || strings.Contains(lm, "gpt-5") {
fieldName = "max_completion_tokens"
} else {
fieldName = "max_tokens"
}
}
requestBody[fieldName] = maxTokens
}
if temperature, ok := asFloat(options["temperature"]); ok {
lm := strings.ToLower(model)
if strings.Contains(lm, "kimi") && strings.Contains(lm, "k2") {
requestBody["temperature"] = 1.0
} else {
requestBody["temperature"] = temperature
}
}

jsonData, err := json.Marshal(requestBody)
if err != nil {
return nil, fmt.Errorf("failed to marshal request: %w", err)
}

req, err := http.NewRequestWithContext(ctx, http.MethodPost, p.apiBase+"/chat/completions", bytes.NewReader(jsonData))
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
if p.apiKey != "" {
req.Header.Set("Authorization", "Bearer "+p.apiKey)
}

// Use a client without a read timeout — context handles cancellation.
streamClient := &http.Client{Transport: p.httpClient.Transport}
resp, err := streamClient.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to send request: %w", err)
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, 256))
return nil, fmt.Errorf("API request failed: status=%d body=%s", resp.StatusCode, responsePreview(body, 128))
}
Comment on lines +176 to +191
toolCalls := make([]ToolCall, 0, len(builders))
for i := 0; i < len(builders); i++ {
b, ok := builders[i]
if !ok {
break
}
toolCalls = append(toolCalls, ToolCall{
ID: b.id,
Type: b.callType,
Name: b.name,
Function: &FunctionCall{
Name: b.name,
Arguments: b.args.String(),
},
})
}
Comment on lines +206 to +208
body, _ := json.Marshal(payload)
req, err := http.NewRequest(http.MethodPost, p.apiBase+"/v1/tts", bytes.NewReader(body))
if err != nil {
Comment on lines +154 to +159
// ctx 取消时关闭连接,使 ReadMessage 立即解除阻塞。
go func() {
<-ctx.Done()
conn.Close()
}()

Comment on lines +99 to +105
// 优先从可执行文件所在目录加载 .env,回退到当前工作目录。
// 已有同名环境变量时跳过(不覆盖),即 Docker environment: 覆盖 > .env。
if exe, err := os.Executable(); err == nil {
loadDotEnv(filepath.Join(filepath.Dir(exe), ".env"))
} else {
loadDotEnv(".env")
}
Comment on lines +249 to +253
s.processStreamSpeech(pipeCtx, sess, feedCh, time.Now(), turnID, memoryID)
// 流水线结束,清理引用。handleAudio 看到 nil 后切换到批量缓冲模式。
// 轻微竞争可接受:最多丢个别帧,下一轮 listen.start 会重新赋值。
s.asrSess = nil
s.asrFeedCh = nil
Comment on lines +107 to +111
// ctx 取消时关闭连接,使 ReadMessage 立即返回
go func() {
<-ctx.Done()
conn.Close()
}()
Comment on lines +161 to +163
upgrader := websocket.Upgrader{
CheckOrigin: func(r *http.Request) bool { return true },
}
独立模块 cmd/picoclaw-voice,实现端到端语音对话链路(ASR → LLM → TTS)。
协议层基于 xiaozhi WebSocket 协议,做了适当扩展。

- 流式 ASR:豆包 / FunASR(本地)
- 流式 TTS:豆包 / Fish Speech(本地)
- Docker Compose 一键部署本地 ASR + TTS 服务
- Python demo 客户端用于联调
@kkroid kkroid force-pushed the feat/local-asr-tts-providers branch from f6f0c97 to 350f5a2 Compare March 16, 2026 09:23
@sipeed-bot sipeed-bot Bot added type: enhancement New feature or request domain: provider domain: channel go Pull requests that update go code labels Mar 16, 2026
@lxowalle
Copy link
Copy Markdown
Collaborator

Nice, we've also recently discussed integrating ASR+TTS functionality internally. I think there are still some TODO to improve:

  1. Need to finalize the configuration file to allow users to customize their own ASR and TTS models. For example, setting which TTS model to use, how to configure the KEY, how to enable it, etc.
  2. Need to determine how to handle non-streaming scenarios, such as when an audio file is sent through the channel
  3. Integrate ASR+TTS into the gateway

@kkroid
Copy link
Copy Markdown
Author

kkroid commented Mar 16, 2026

Nice, we've also recently discussed integrating ASR+TTS functionality internally. I think there are still some TODO to improve:

  1. Need to finalize the configuration file to allow users to customize their own ASR and TTS models. For example, setting which TTS model to use, how to configure the KEY, how to enable it, etc.
  2. Need to determine how to handle non-streaming scenarios, such as when an audio file is sent through the channel
  3. Integrate ASR+TTS into the gateway

Thanks for your feedback

  1. Config — Currently using env vars + .env file. Happy to align with whatever config format the team prefers (e.g. extending config.json/config.yaml).
  2. Non-streaming — Will look into supporting audio file input via channels as a follow-up.
  3. Gateway integration — Currently this PR keeps it as a standalone module. I agree that integrating them into the main gateway is a good idea.

@opcache
Copy link
Copy Markdown
Contributor

opcache commented Mar 17, 2026

666 ,支持,做了我想做的,牛啊

@lxowalle
Copy link
Copy Markdown
Collaborator

#1648 has a more detailed proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: channel domain: provider go Pull requests that update go code type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants