feat: add audio transcription via Groq Whisper API by virat-mankali · Pull Request #977 · sipeed/picoclaw

virat-mankali · 2026-03-02T05:01:50Z

Summary

Adds inbound audio transcription support. When a user sends a voice note or audio file (via Telegram or any other channel), the audio is automatically transcribed using Groq's Whisper API before being passed to the LLM — so the agent can understand and respond to spoken messages naturally.

Changes

pkg/transcribe/groq.go — new package with a minimal Groq Whisper client using only stdlib (net/http, mime/multipart). No new dependencies.
pkg/config/config.go — adds TranscribeConfig struct and Transcribe field to ToolsConfig
pkg/agent/loop.go — wires transcriber into processMessage, resolves audio media refs, calls Groq, and injects the transcript into the user message before the LLM sees it

Configuration

Add to config.json under tools:

"transcribe": {
  "enabled": true,
  "api_key": "gsk_...",
  "model": "whisper-large-v3-turbo"
}

Or via env vars:

PICOCLAW_TRANSCRIBE_ENABLED=true
PICOCLAW_TRANSCRIBE_API_KEY=gsk_...
PICOCLAW_TRANSCRIBE_MODEL=whisper-large-v3-turbo

Notes

Zero overhead when disabled (transcriber is nil, no allocations)
~1KB idle RAM, ~200KB peak per transcription call (freed immediately after)
Supports all audio formats Groq accepts: .ogg, .mp3, .m4a, .wav, .webm, .opus, .flac
Replaces [voice]/[audio] placeholders with the actual transcript in the user message

Copilot

Pull request overview

Adds inbound audio transcription support so voice notes/audio files can be transcribed (via Groq Whisper) and the resulting text is injected into the user message before the LLM processes it.

Changes:

Introduces a new pkg/transcribe Groq Whisper client using only the Go stdlib.
Extends configuration with tools.transcribe (enabled/api_key/model).
Wires transcription into AgentLoop.processMessage by resolving inbound media refs and appending transcripts into the user message.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
`pkg/transcribe/groq.go`	New Groq Whisper transcription client (multipart upload + JSON response parsing).
`pkg/config/config.go`	Adds `TranscribeConfig` and exposes it under `ToolsConfig`.
`pkg/agent/loop.go`	Conditionally enables transcriber and injects transcripts into inbound user messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T05:06:30Z

+	var body bytes.Buffer
+	w := multipart.NewWriter(&body)
+


This implementation buffers the entire audio file into a bytes.Buffer before sending the request, so peak memory usage scales with the audio file size (plus multipart overhead). If large uploads are possible, consider streaming the multipart body (e.g., io.Pipe) to keep memory bounded.

Copilot · 2026-03-02T05:06:30Z

+// Transcriber transcribes audio files using the Groq Whisper API.
+type Transcriber struct {
+	apiKey     string
+	model      string
+	httpClient *http.Client
+}
+
+// New creates a Transcriber with the given Groq API key and optional model override.
+// If model is empty, whisper-large-v3-turbo is used.
+func New(apiKey, model string) *Transcriber {
+	m := model
+	if m == "" {
+		m = defaultModel
+	}
+	return &Transcriber{
+		apiKey: apiKey,
+		model:  m,
+		httpClient: &http.Client{
+			Timeout: 60 * time.Second,
+		},
+	}
+}


There is already a Groq Whisper client in pkg/voice/transcriber.go (GroqTranscriber) that builds the same multipart request. Having two separate implementations increases maintenance burden (bug fixes, API changes, timeouts, etc. need to be duplicated); consider reusing/refactoring to a single shared implementation or clearly separating responsibilities (e.g., pkg/transcribe wraps pkg/voice).

Copilot · 2026-03-02T05:06:30Z

+	} else if al.transcriber == nil && len(msg.Media) > 0 {
+		logger.WarnCF("agent", "Transcriber not configured, skipping media", map[string]any{
+			"media_count": len(msg.Media),
+		})
+	}


This branch will log a WARN for any inbound message with media when transcription is disabled (al.transcriber == nil), including images and other non-audio attachments. That will be noisy in normal operation; gate this warning on "transcribe enabled but misconfigured" (e.g., al.cfg.Tools.Transcribe.Enabled) and/or only warn when an audio attachment is detected.

Copilot · 2026-03-02T05:06:30Z

+				continue
+			}
+			logger.InfoCF("agent", "Transcribing audio", map[string]any{
+				"ref": ref, "path": localPath, "filename": meta.Filename,


Logging the resolved local file path at INFO level may leak host filesystem details into logs. Consider omitting the path field (the media ref + filename is usually sufficient) or moving path logging to DEBUG only.

Suggested change

"ref": ref, "path": localPath, "filename": meta.Filename,

"ref": ref, "filename": meta.Filename,

Copilot · 2026-03-02T05:06:31Z

+			// Replace the [voice]/[audio] placeholder with the actual transcript
+			userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
+			userMessage = strings.TrimSpace(userMessage)
+			if userMessage != "" {
+				userMessage = userMessage + "\n\n[Voice transcript]: " + transcript
+			} else {
+				userMessage = transcript


If the transcription result is empty/whitespace (e.g., silence or API returns empty text), the code will still strip placeholders and append an empty "[Voice transcript]:" block. Handle empty transcripts by skipping injection/logging a debug message so userMessage isn't degraded.

Suggested change

// Replace the [voice]/[audio] placeholder with the actual transcript

userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)

userMessage = strings.TrimSpace(userMessage)

if userMessage != "" {

userMessage = userMessage + "\n\n[Voice transcript]: " + transcript

} else {

userMessage = transcript

trimmedTranscript := strings.TrimSpace(transcript)

if trimmedTranscript == "" {

logger.DebugCF("agent", "Empty audio transcription result, skipping transcript injection", map[string]any{

"ref": ref,

})

continue

}

// Replace the [voice]/[audio] placeholder with the actual transcript

userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)

userMessage = strings.TrimSpace(userMessage)

if userMessage != "" {

userMessage = userMessage + "\n\n[Voice transcript]: " + trimmedTranscript

} else {

userMessage = trimmedTranscript

Copilot · 2026-03-02T05:06:31Z

+	// Transcribe any audio media refs before passing to the agent.
+	userMessage := msg.Content
+	if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
+		logger.DebugCF("agent", "Checking media for transcription", map[string]any{
+			"media_count": len(msg.Media),
+		})
+		for _, ref := range msg.Media {
+			localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
+			if err != nil {


New behavior in processMessage mutates the user message based on media attachments (resolve -> IsAudio -> Transcribe -> placeholder replacement). pkg/agent/loop.go already has extensive unit tests, but this new path isn't covered; please add tests that cover (1) audio attachment with [voice]/[audio] placeholder, (2) audio-only message, and (3) non-audio media ensuring no transcription attempt.

Copilot · 2026-03-02T05:06:31Z

+	if err = w.WriteField("response_format", "json"); err != nil {
+		return "", fmt.Errorf("transcribe: write response_format field: %w", err)
+	}
+	w.Close()


multipart.Writer.Close() can return an error; currently the result of w.Close() is ignored, which may lead to sending an incomplete/malformed multipart body and confusing API errors. Capture and return the Close error (as done in pkg/voice/transcriber.go).

Suggested change

w.Close()

if err = w.Close(); err != nil {

return "", fmt.Errorf("transcribe: close multipart writer: %w", err)

}

Copilot · 2026-03-02T05:06:31Z

+	respBytes, err := io.ReadAll(resp.Body)
+	if err != nil {
+		return "", fmt.Errorf("transcribe: read response: %w", err)
+	}
+
+	if resp.StatusCode != http.StatusOK {
+		return "", fmt.Errorf("transcribe: groq returned %d: %s", resp.StatusCode, string(respBytes))
+	}


io.ReadAll(resp.Body) reads the entire response into memory and the error path includes the full body in the returned error string. Consider bounding reads (e.g., with io.LimitReader) and truncating/redacting the body included in errors to avoid large allocations and accidental logging of unexpected content.

Transcribe inbound voice/audio messages before passing to the LLM. Reuses existing pkg/voice.GroqTranscriber (no duplicate implementation). - Add TranscribeConfig to ToolsConfig (enabled, api_key, model) - Wire transcriber into AgentLoop.processMessage - Use inferMediaType() to detect audio files (reuses existing helper) - Supports env vars: PICOCLAW_TRANSCRIBE_ENABLED, PICOCLAW_TRANSCRIBE_API_KEY - Zero overhead when disabled (transcriber is nil)

virat-mankali

Updated the code to use the groq thing that was already present. and make it working properly for telegram inbound audio notes. So users can speak and this will handle smothly.

virat-mankali · 2026-03-02T05:51:01Z

@copilot review the pr

CLAassistant · 2026-03-05T15:01:28Z

All committers have signed the CLA.

mengzhuo · 2026-03-06T02:33:46Z

Please fix conflicts

…ol-configs)

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pkg/agent/loop.go:1755

There are extra blank lines at EOF. Running gofmt (or removing the trailing empty lines) will keep the file clean and avoid style-only diffs/noise in future changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T15:54:52Z

+	if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
+		for _, ref := range msg.Media {
+			localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
+			if err != nil {


Errors from ResolveWithMeta are swallowed silently (continue with no log/metric). This makes it very hard to diagnose why a voice note wasn't transcribed (e.g., non-media:// refs, expired store entries, missing files). Please log at least a warning with the ref and error (ideally consistent with transcribeAudioInMessage).

Suggested change

if err != nil {

if err != nil {

logger.WarnCF("agent", "Failed to resolve media ref", map[string]any{

"ref": ref,

"error": err.Error(),

})

Copilot · 2026-03-08T15:54:52Z

+			// Handle audio transcription
+			if mediaType == "audio" {
+				logger.InfoCF("agent", "Transcribing audio", map[string]any{
+					"ref": ref, "path": localPath, "filename": meta.Filename,


This log entry includes the resolved local filesystem path for the media ("path": localPath). File paths can leak host details into logs and may contain user-identifying information depending on storage location. Consider omitting path (or logging only a basename / media ref) and keep filename/ref for correlation.

Suggested change

"ref": ref, "path": localPath, "filename": meta.Filename,

"ref": ref, "filename": meta.Filename,

Copilot · 2026-03-08T15:54:52Z

 // openaiMessage is the wire-format message for OpenAI-compatible APIs.
 // It mirrors protocoltypes.Message but omits SystemParts, which is an
 // internal field that would be unknown to third-party endpoints.
+// Content can be either a string or an array of content blocks for multimodal messages.


The new comment claims Content may be either a string or an array of multimodal content blocks, but openaiMessage.Content is still typed as string. This is misleading for future maintainers and API consumers; either update the field type/JSON marshaling to support both shapes or adjust the comment to match the actual wire format used here.

Suggested change

// openaiMessage is the wire-format message for OpenAI-compatible APIs.

// It mirrors protocoltypes.Message but omits SystemParts, which is an

// internal field that would be unknown to third-party endpoints.

// Content can be either a string or an array of content blocks for multimodal messages.

// openaiMessage is the wire-format message for OpenAI-compatible APIs for

// simple text-only messages. It mirrors protocoltypes.Message but omits

// SystemParts, which is an internal field that would be unknown to

// third-party endpoints. For messages that include media, serializeMessages

// constructs a map[string]any where "content" is an array of content blocks,

// instead of using this struct.

Copilot · 2026-03-08T15:54:52Z

@@ -667,6 +673,7 @@ type ToolsConfig struct {
 	Exec            ExecConfig         `json:"exec"`
 	Skills          SkillsToolsConfig  `json:"skills"`
 	MediaCleanup    MediaCleanupConfig `json:"media_cleanup"`
+	Transcribe      TranscribeConfig   `json:"transcribe"`


TranscribeConfig is added under ToolsConfig, but it is not referenced anywhere in the codebase (no reads of cfg.Tools.Transcribe.*). As a result, the documented tools.transcribe config/env vars won't actually enable/disable or configure transcription. Either wire cfg.Tools.Transcribe into transcriber detection/initialization (and honor Enabled/Model/APIKey) or remove this config block to avoid a dead configuration surface.

Copilot · 2026-03-08T15:54:53Z

+	userMessage := msg.Content
+
+	if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
+		for _, ref := range msg.Media {
+			localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
+			if err != nil {
+				continue
+			}
+
+			mediaType := inferMediaType(meta.Filename, meta.ContentType)
+
+			// Handle audio transcription
+			if mediaType == "audio" {
+				logger.InfoCF("agent", "Transcribing audio", map[string]any{
+					"ref": ref, "path": localPath, "filename": meta.Filename,
+				})
+				result, err := al.transcriber.Transcribe(ctx, localPath)
+				if err != nil {
+					logger.WarnCF("agent", "Audio transcription failed", map[string]any{
+						"ref": ref, "error": err.Error(),
+					})
+					continue
+				}
+				logger.InfoCF("agent", "Transcribed audio", map[string]any{
+					"ref": ref, "length": len(result.Text),
+				})
+				// Replace the [voice]/[audio] placeholder with the actual transcript
+				userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
+				userMessage = strings.TrimSpace(userMessage)
+				if userMessage != "" {
+					userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text
+				} else {
+					userMessage = result.Text
+				}
+			}
+		}
+	}
+


processMessage already calls al.transcribeAudioInMessage(ctx, msg) earlier, which resolves/transcribes audio media and updates msg.Content. The additional transcription loop introduced here will re-resolve and re-transcribe the same audio a second time, increasing latency/cost and potentially duplicating transcript text. Consider removing this block and relying on transcribeAudioInMessage, or refactor to a single transcription pass that both replaces placeholders and builds the final user message.

Suggested change

userMessage := msg.Content

if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {

for _, ref := range msg.Media {

localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)

if err != nil {

continue

}

mediaType := inferMediaType(meta.Filename, meta.ContentType)

// Handle audio transcription

if mediaType == "audio" {

logger.InfoCF("agent", "Transcribing audio", map[string]any{

"ref": ref, "path": localPath, "filename": meta.Filename,

})

result, err := al.transcriber.Transcribe(ctx, localPath)

if err != nil {

logger.WarnCF("agent", "Audio transcription failed", map[string]any{

"ref": ref, "error": err.Error(),

})

continue

}

logger.InfoCF("agent", "Transcribed audio", map[string]any{

"ref": ref, "length": len(result.Text),

})

// Replace the [voice]/[audio] placeholder with the actual transcript

userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)

userMessage = strings.TrimSpace(userMessage)

if userMessage != "" {

userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text

} else {

userMessage = result.Text

}

}

}

}

// At this point, processMessage has already invoked transcribeAudioInMessage,

// which resolves/transcribes audio media and updates msg.Content as needed.

userMessage := msg.Content

… fix comment, trim EOF

sipeed-bot · 2026-03-25T11:46:52Z

@virat-mankali Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things organized. Feel free to reopen anytime if you'd like to continue.

Copilot AI review requested due to automatic review settings March 2, 2026 05:01

Copilot started reviewing on behalf of virat-mankali March 2, 2026 05:02 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

virat-mankali force-pushed the feat/audio-transcription branch from ae304f2 to 54887ca Compare March 2, 2026 05:26

virat-mankali commented Mar 2, 2026

View reviewed changes

sipeed-bot bot added type: enhancement New feature or request domain: agent domain: config labels Mar 3, 2026

virat-mankali added 2 commits March 8, 2026 21:08

merge: resolve conflicts with main (audio-transcription + commands/to…

9118879

…ol-configs)

fix: resolve merge conflicts with main, adapt transcriber to interface

dfe65da

Copilot AI review requested due to automatic review settings March 8, 2026 15:52

Copilot started reviewing on behalf of virat-mankali March 8, 2026 15:52 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

fix: address PR review - remove duplicate transcription, dead config,…

65fd049

… fix comment, trim EOF

This was referenced Mar 9, 2026

🦞 OpenClaw 生态日报 2026-03-09 duanyytop/agents-radar#107

Open

🦞 OpenClaw Ecosystem Digest 2026-03-09 duanyytop/agents-radar#108

Open

🦞 OpenClaw 生态日报 2026-03-09 rollysys/agents-radar#58

Open

sipeed-bot bot closed this Mar 25, 2026

	"ref": ref, "path": localPath, "filename": meta.Filename,
	"ref": ref, "filename": meta.Filename,

-	w.Close()
+	if err = w.Close(); err != nil {
+		return "", fmt.Errorf("transcribe: close multipart writer: %w", err)
+	}

-// openaiMessage is the wire-format message for OpenAI-compatible APIs.
-// It mirrors protocoltypes.Message but omits SystemParts, which is an
-// internal field that would be unknown to third-party endpoints.
-// Content can be either a string or an array of content blocks for multimodal messages.
+// openaiMessage is the wire-format message for OpenAI-compatible APIs for
+// simple text-only messages. It mirrors protocoltypes.Message but omits
+// SystemParts, which is an internal field that would be unknown to
+// third-party endpoints. For messages that include media, serializeMessages
+// constructs a map[string]any where "content" is an array of content blocks,
+// instead of using this struct.

Conversation

virat-mankali commented Mar 2, 2026

Summary

Changes

Configuration

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

virat-mankali left a comment

Choose a reason for hiding this comment

Uh oh!

virat-mankali commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mengzhuo commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

sipeed-bot bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Mar 5, 2026 •

edited

Loading