Skip to content

feat: add audio transcription via Groq Whisper API#977

Closed
virat-mankali wants to merge 4 commits intosipeed:mainfrom
virat-mankali:feat/audio-transcription
Closed

feat: add audio transcription via Groq Whisper API#977
virat-mankali wants to merge 4 commits intosipeed:mainfrom
virat-mankali:feat/audio-transcription

Conversation

@virat-mankali
Copy link
Copy Markdown

Summary

Adds inbound audio transcription support. When a user sends a voice note or audio file (via Telegram or any other channel), the audio is automatically transcribed using Groq's Whisper API before being passed to the LLM — so the agent can understand and respond to spoken messages naturally.

Changes

  • pkg/transcribe/groq.go — new package with a minimal Groq Whisper client using only stdlib (net/http, mime/multipart). No new dependencies.
  • pkg/config/config.go — adds TranscribeConfig struct and Transcribe field to ToolsConfig
  • pkg/agent/loop.go — wires transcriber into processMessage, resolves audio media refs, calls Groq, and injects the transcript into the user message before the LLM sees it

Configuration

Add to config.json under tools:

"transcribe": {
  "enabled": true,
  "api_key": "gsk_...",
  "model": "whisper-large-v3-turbo"
}

Or via env vars:

  • PICOCLAW_TRANSCRIBE_ENABLED=true
  • PICOCLAW_TRANSCRIBE_API_KEY=gsk_...
  • PICOCLAW_TRANSCRIBE_MODEL=whisper-large-v3-turbo

Notes

  • Zero overhead when disabled (transcriber is nil, no allocations)
  • ~1KB idle RAM, ~200KB peak per transcription call (freed immediately after)
  • Supports all audio formats Groq accepts: .ogg, .mp3, .m4a, .wav, .webm, .opus, .flac
  • Replaces [voice]/[audio] placeholders with the actual transcript in the user message

Copilot AI review requested due to automatic review settings March 2, 2026 05:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds inbound audio transcription support so voice notes/audio files can be transcribed (via Groq Whisper) and the resulting text is injected into the user message before the LLM processes it.

Changes:

  • Introduces a new pkg/transcribe Groq Whisper client using only the Go stdlib.
  • Extends configuration with tools.transcribe (enabled/api_key/model).
  • Wires transcription into AgentLoop.processMessage by resolving inbound media refs and appending transcripts into the user message.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
pkg/transcribe/groq.go New Groq Whisper transcription client (multipart upload + JSON response parsing).
pkg/config/config.go Adds TranscribeConfig and exposes it under ToolsConfig.
pkg/agent/loop.go Conditionally enables transcriber and injects transcripts into inbound user messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/transcribe/groq.go Outdated
Comment on lines +68 to +70
var body bytes.Buffer
w := multipart.NewWriter(&body)

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation buffers the entire audio file into a bytes.Buffer before sending the request, so peak memory usage scales with the audio file size (plus multipart overhead). If large uploads are possible, consider streaming the multipart body (e.g., io.Pipe) to keep memory bounded.

Copilot uses AI. Check for mistakes.
Comment thread pkg/transcribe/groq.go Outdated
Comment on lines +23 to +44
// Transcriber transcribes audio files using the Groq Whisper API.
type Transcriber struct {
apiKey string
model string
httpClient *http.Client
}

// New creates a Transcriber with the given Groq API key and optional model override.
// If model is empty, whisper-large-v3-turbo is used.
func New(apiKey, model string) *Transcriber {
m := model
if m == "" {
m = defaultModel
}
return &Transcriber{
apiKey: apiKey,
model: m,
httpClient: &http.Client{
Timeout: 60 * time.Second,
},
}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a Groq Whisper client in pkg/voice/transcriber.go (GroqTranscriber) that builds the same multipart request. Having two separate implementations increases maintenance burden (bug fixes, API changes, timeouts, etc. need to be duplicated); consider reusing/refactoring to a single shared implementation or clearly separating responsibilities (e.g., pkg/transcribe wraps pkg/voice).

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
Comment on lines +467 to +471
} else if al.transcriber == nil && len(msg.Media) > 0 {
logger.WarnCF("agent", "Transcriber not configured, skipping media", map[string]any{
"media_count": len(msg.Media),
})
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch will log a WARN for any inbound message with media when transcription is disabled (al.transcriber == nil), including images and other non-audio attachments. That will be noisy in normal operation; gate this warning on "transcribe enabled but misconfigured" (e.g., al.cfg.Tools.Transcribe.Enabled) and/or only warn when an audio attachment is detected.

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
continue
}
logger.InfoCF("agent", "Transcribing audio", map[string]any{
"ref": ref, "path": localPath, "filename": meta.Filename,
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the resolved local file path at INFO level may leak host filesystem details into logs. Consider omitting the path field (the media ref + filename is usually sufficient) or moving path logging to DEBUG only.

Suggested change
"ref": ref, "path": localPath, "filename": meta.Filename,
"ref": ref, "filename": meta.Filename,

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
Comment on lines +458 to +464
// Replace the [voice]/[audio] placeholder with the actual transcript
userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
userMessage = strings.TrimSpace(userMessage)
if userMessage != "" {
userMessage = userMessage + "\n\n[Voice transcript]: " + transcript
} else {
userMessage = transcript
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the transcription result is empty/whitespace (e.g., silence or API returns empty text), the code will still strip placeholders and append an empty "[Voice transcript]:" block. Handle empty transcripts by skipping injection/logging a debug message so userMessage isn't degraded.

Suggested change
// Replace the [voice]/[audio] placeholder with the actual transcript
userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
userMessage = strings.TrimSpace(userMessage)
if userMessage != "" {
userMessage = userMessage + "\n\n[Voice transcript]: " + transcript
} else {
userMessage = transcript
trimmedTranscript := strings.TrimSpace(transcript)
if trimmedTranscript == "" {
logger.DebugCF("agent", "Empty audio transcription result, skipping transcript injection", map[string]any{
"ref": ref,
})
continue
}
// Replace the [voice]/[audio] placeholder with the actual transcript
userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
userMessage = strings.TrimSpace(userMessage)
if userMessage != "" {
userMessage = userMessage + "\n\n[Voice transcript]: " + trimmedTranscript
} else {
userMessage = trimmedTranscript

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
Comment on lines +425 to +433
// Transcribe any audio media refs before passing to the agent.
userMessage := msg.Content
if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
logger.DebugCF("agent", "Checking media for transcription", map[string]any{
"media_count": len(msg.Media),
})
for _, ref := range msg.Media {
localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
if err != nil {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior in processMessage mutates the user message based on media attachments (resolve -> IsAudio -> Transcribe -> placeholder replacement). pkg/agent/loop.go already has extensive unit tests, but this new path isn't covered; please add tests that cover (1) audio attachment with [voice]/[audio] placeholder, (2) audio-only message, and (3) non-audio media ensuring no transcription attempt.

Copilot uses AI. Check for mistakes.
Comment thread pkg/transcribe/groq.go Outdated
if err = w.WriteField("response_format", "json"); err != nil {
return "", fmt.Errorf("transcribe: write response_format field: %w", err)
}
w.Close()
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multipart.Writer.Close() can return an error; currently the result of w.Close() is ignored, which may lead to sending an incomplete/malformed multipart body and confusing API errors. Capture and return the Close error (as done in pkg/voice/transcriber.go).

Suggested change
w.Close()
if err = w.Close(); err != nil {
return "", fmt.Errorf("transcribe: close multipart writer: %w", err)
}

Copilot uses AI. Check for mistakes.
Comment thread pkg/transcribe/groq.go Outdated
Comment on lines +99 to +106
respBytes, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("transcribe: read response: %w", err)
}

if resp.StatusCode != http.StatusOK {
return "", fmt.Errorf("transcribe: groq returned %d: %s", resp.StatusCode, string(respBytes))
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.ReadAll(resp.Body) reads the entire response into memory and the error path includes the full body in the returned error string. Consider bounding reads (e.g., with io.LimitReader) and truncating/redacting the body included in errors to avoid large allocations and accidental logging of unexpected content.

Copilot uses AI. Check for mistakes.
Transcribe inbound voice/audio messages before passing to the LLM.
Reuses existing pkg/voice.GroqTranscriber (no duplicate implementation).

- Add TranscribeConfig to ToolsConfig (enabled, api_key, model)
- Wire transcriber into AgentLoop.processMessage
- Use inferMediaType() to detect audio files (reuses existing helper)
- Supports env vars: PICOCLAW_TRANSCRIBE_ENABLED, PICOCLAW_TRANSCRIBE_API_KEY
- Zero overhead when disabled (transcriber is nil)
@virat-mankali virat-mankali force-pushed the feat/audio-transcription branch from ae304f2 to 54887ca Compare March 2, 2026 05:26
Copy link
Copy Markdown
Author

@virat-mankali virat-mankali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to use the groq thing that was already present. and make it working properly for telegram inbound audio notes. So users can speak and this will handle smothly.

@virat-mankali
Copy link
Copy Markdown
Author

@copilot review the pr

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 5, 2026

CLA assistant check
All committers have signed the CLA.

@mengzhuo
Copy link
Copy Markdown
Collaborator

mengzhuo commented Mar 6, 2026

Please fix conflicts

Copilot AI review requested due to automatic review settings March 8, 2026 15:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pkg/agent/loop.go:1755

  • There are extra blank lines at EOF. Running gofmt (or removing the trailing empty lines) will keep the file clean and avoid style-only diffs/noise in future changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/agent/loop.go Outdated
if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
for _, ref := range msg.Media {
localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
if err != nil {
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors from ResolveWithMeta are swallowed silently (continue with no log/metric). This makes it very hard to diagnose why a voice note wasn't transcribed (e.g., non-media:// refs, expired store entries, missing files). Please log at least a warning with the ref and error (ideally consistent with transcribeAudioInMessage).

Suggested change
if err != nil {
if err != nil {
logger.WarnCF("agent", "Failed to resolve media ref", map[string]any{
"ref": ref,
"error": err.Error(),
})

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
// Handle audio transcription
if mediaType == "audio" {
logger.InfoCF("agent", "Transcribing audio", map[string]any{
"ref": ref, "path": localPath, "filename": meta.Filename,
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log entry includes the resolved local filesystem path for the media ("path": localPath). File paths can leak host details into logs and may contain user-identifying information depending on storage location. Consider omitting path (or logging only a basename / media ref) and keep filename/ref for correlation.

Suggested change
"ref": ref, "path": localPath, "filename": meta.Filename,
"ref": ref, "filename": meta.Filename,

Copilot uses AI. Check for mistakes.
Comment thread pkg/providers/openai_compat/provider.go Outdated
Comment on lines +364 to +367
// openaiMessage is the wire-format message for OpenAI-compatible APIs.
// It mirrors protocoltypes.Message but omits SystemParts, which is an
// internal field that would be unknown to third-party endpoints.
// Content can be either a string or an array of content blocks for multimodal messages.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment claims Content may be either a string or an array of multimodal content blocks, but openaiMessage.Content is still typed as string. This is misleading for future maintainers and API consumers; either update the field type/JSON marshaling to support both shapes or adjust the comment to match the actual wire format used here.

Suggested change
// openaiMessage is the wire-format message for OpenAI-compatible APIs.
// It mirrors protocoltypes.Message but omits SystemParts, which is an
// internal field that would be unknown to third-party endpoints.
// Content can be either a string or an array of content blocks for multimodal messages.
// openaiMessage is the wire-format message for OpenAI-compatible APIs for
// simple text-only messages. It mirrors protocoltypes.Message but omits
// SystemParts, which is an internal field that would be unknown to
// third-party endpoints. For messages that include media, serializeMessages
// constructs a map[string]any where "content" is an array of content blocks,
// instead of using this struct.

Copilot uses AI. Check for mistakes.
Comment thread pkg/config/config.go Outdated
Comment on lines +662 to +676
@@ -667,6 +673,7 @@ type ToolsConfig struct {
Exec ExecConfig `json:"exec"`
Skills SkillsToolsConfig `json:"skills"`
MediaCleanup MediaCleanupConfig `json:"media_cleanup"`
Transcribe TranscribeConfig `json:"transcribe"`
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TranscribeConfig is added under ToolsConfig, but it is not referenced anywhere in the codebase (no reads of cfg.Tools.Transcribe.*). As a result, the documented tools.transcribe config/env vars won't actually enable/disable or configure transcription. Either wire cfg.Tools.Transcribe into transcriber detection/initialization (and honor Enabled/Model/APIKey) or remove this config block to avoid a dead configuration surface.

Copilot uses AI. Check for mistakes.
Comment thread pkg/agent/loop.go Outdated
Comment on lines +626 to +663
userMessage := msg.Content

if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
for _, ref := range msg.Media {
localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
if err != nil {
continue
}

mediaType := inferMediaType(meta.Filename, meta.ContentType)

// Handle audio transcription
if mediaType == "audio" {
logger.InfoCF("agent", "Transcribing audio", map[string]any{
"ref": ref, "path": localPath, "filename": meta.Filename,
})
result, err := al.transcriber.Transcribe(ctx, localPath)
if err != nil {
logger.WarnCF("agent", "Audio transcription failed", map[string]any{
"ref": ref, "error": err.Error(),
})
continue
}
logger.InfoCF("agent", "Transcribed audio", map[string]any{
"ref": ref, "length": len(result.Text),
})
// Replace the [voice]/[audio] placeholder with the actual transcript
userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
userMessage = strings.TrimSpace(userMessage)
if userMessage != "" {
userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text
} else {
userMessage = result.Text
}
}
}
}

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processMessage already calls al.transcribeAudioInMessage(ctx, msg) earlier, which resolves/transcribes audio media and updates msg.Content. The additional transcription loop introduced here will re-resolve and re-transcribe the same audio a second time, increasing latency/cost and potentially duplicating transcript text. Consider removing this block and relying on transcribeAudioInMessage, or refactor to a single transcription pass that both replaces placeholders and builds the final user message.

Suggested change
userMessage := msg.Content
if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil {
for _, ref := range msg.Media {
localPath, meta, err := al.mediaStore.ResolveWithMeta(ref)
if err != nil {
continue
}
mediaType := inferMediaType(meta.Filename, meta.ContentType)
// Handle audio transcription
if mediaType == "audio" {
logger.InfoCF("agent", "Transcribing audio", map[string]any{
"ref": ref, "path": localPath, "filename": meta.Filename,
})
result, err := al.transcriber.Transcribe(ctx, localPath)
if err != nil {
logger.WarnCF("agent", "Audio transcription failed", map[string]any{
"ref": ref, "error": err.Error(),
})
continue
}
logger.InfoCF("agent", "Transcribed audio", map[string]any{
"ref": ref, "length": len(result.Text),
})
// Replace the [voice]/[audio] placeholder with the actual transcript
userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage)
userMessage = strings.TrimSpace(userMessage)
if userMessage != "" {
userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text
} else {
userMessage = result.Text
}
}
}
}
// At this point, processMessage has already invoked transcribeAudioInMessage,
// which resolves/transcribes audio media and updates msg.Content as needed.
userMessage := msg.Content

Copilot uses AI. Check for mistakes.
@sipeed-bot
Copy link
Copy Markdown

sipeed-bot bot commented Mar 25, 2026

@virat-mankali Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things organized. Feel free to reopen anytime if you'd like to continue.

@sipeed-bot sipeed-bot bot closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants