feat: add audio transcription via Groq Whisper API#977
feat: add audio transcription via Groq Whisper API#977virat-mankali wants to merge 4 commits intosipeed:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds inbound audio transcription support so voice notes/audio files can be transcribed (via Groq Whisper) and the resulting text is injected into the user message before the LLM processes it.
Changes:
- Introduces a new
pkg/transcribeGroq Whisper client using only the Go stdlib. - Extends configuration with
tools.transcribe(enabled/api_key/model). - Wires transcription into
AgentLoop.processMessageby resolving inbound media refs and appending transcripts into the user message.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
pkg/transcribe/groq.go |
New Groq Whisper transcription client (multipart upload + JSON response parsing). |
pkg/config/config.go |
Adds TranscribeConfig and exposes it under ToolsConfig. |
pkg/agent/loop.go |
Conditionally enables transcriber and injects transcripts into inbound user messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var body bytes.Buffer | ||
| w := multipart.NewWriter(&body) | ||
|
|
There was a problem hiding this comment.
This implementation buffers the entire audio file into a bytes.Buffer before sending the request, so peak memory usage scales with the audio file size (plus multipart overhead). If large uploads are possible, consider streaming the multipart body (e.g., io.Pipe) to keep memory bounded.
| // Transcriber transcribes audio files using the Groq Whisper API. | ||
| type Transcriber struct { | ||
| apiKey string | ||
| model string | ||
| httpClient *http.Client | ||
| } | ||
|
|
||
| // New creates a Transcriber with the given Groq API key and optional model override. | ||
| // If model is empty, whisper-large-v3-turbo is used. | ||
| func New(apiKey, model string) *Transcriber { | ||
| m := model | ||
| if m == "" { | ||
| m = defaultModel | ||
| } | ||
| return &Transcriber{ | ||
| apiKey: apiKey, | ||
| model: m, | ||
| httpClient: &http.Client{ | ||
| Timeout: 60 * time.Second, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
There is already a Groq Whisper client in pkg/voice/transcriber.go (GroqTranscriber) that builds the same multipart request. Having two separate implementations increases maintenance burden (bug fixes, API changes, timeouts, etc. need to be duplicated); consider reusing/refactoring to a single shared implementation or clearly separating responsibilities (e.g., pkg/transcribe wraps pkg/voice).
| } else if al.transcriber == nil && len(msg.Media) > 0 { | ||
| logger.WarnCF("agent", "Transcriber not configured, skipping media", map[string]any{ | ||
| "media_count": len(msg.Media), | ||
| }) | ||
| } |
There was a problem hiding this comment.
This branch will log a WARN for any inbound message with media when transcription is disabled (al.transcriber == nil), including images and other non-audio attachments. That will be noisy in normal operation; gate this warning on "transcribe enabled but misconfigured" (e.g., al.cfg.Tools.Transcribe.Enabled) and/or only warn when an audio attachment is detected.
| continue | ||
| } | ||
| logger.InfoCF("agent", "Transcribing audio", map[string]any{ | ||
| "ref": ref, "path": localPath, "filename": meta.Filename, |
There was a problem hiding this comment.
Logging the resolved local file path at INFO level may leak host filesystem details into logs. Consider omitting the path field (the media ref + filename is usually sufficient) or moving path logging to DEBUG only.
| "ref": ref, "path": localPath, "filename": meta.Filename, | |
| "ref": ref, "filename": meta.Filename, |
| // Replace the [voice]/[audio] placeholder with the actual transcript | ||
| userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage) | ||
| userMessage = strings.TrimSpace(userMessage) | ||
| if userMessage != "" { | ||
| userMessage = userMessage + "\n\n[Voice transcript]: " + transcript | ||
| } else { | ||
| userMessage = transcript |
There was a problem hiding this comment.
If the transcription result is empty/whitespace (e.g., silence or API returns empty text), the code will still strip placeholders and append an empty "[Voice transcript]:" block. Handle empty transcripts by skipping injection/logging a debug message so userMessage isn't degraded.
| // Replace the [voice]/[audio] placeholder with the actual transcript | |
| userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage) | |
| userMessage = strings.TrimSpace(userMessage) | |
| if userMessage != "" { | |
| userMessage = userMessage + "\n\n[Voice transcript]: " + transcript | |
| } else { | |
| userMessage = transcript | |
| trimmedTranscript := strings.TrimSpace(transcript) | |
| if trimmedTranscript == "" { | |
| logger.DebugCF("agent", "Empty audio transcription result, skipping transcript injection", map[string]any{ | |
| "ref": ref, | |
| }) | |
| continue | |
| } | |
| // Replace the [voice]/[audio] placeholder with the actual transcript | |
| userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage) | |
| userMessage = strings.TrimSpace(userMessage) | |
| if userMessage != "" { | |
| userMessage = userMessage + "\n\n[Voice transcript]: " + trimmedTranscript | |
| } else { | |
| userMessage = trimmedTranscript |
| // Transcribe any audio media refs before passing to the agent. | ||
| userMessage := msg.Content | ||
| if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil { | ||
| logger.DebugCF("agent", "Checking media for transcription", map[string]any{ | ||
| "media_count": len(msg.Media), | ||
| }) | ||
| for _, ref := range msg.Media { | ||
| localPath, meta, err := al.mediaStore.ResolveWithMeta(ref) | ||
| if err != nil { |
There was a problem hiding this comment.
New behavior in processMessage mutates the user message based on media attachments (resolve -> IsAudio -> Transcribe -> placeholder replacement). pkg/agent/loop.go already has extensive unit tests, but this new path isn't covered; please add tests that cover (1) audio attachment with [voice]/[audio] placeholder, (2) audio-only message, and (3) non-audio media ensuring no transcription attempt.
| if err = w.WriteField("response_format", "json"); err != nil { | ||
| return "", fmt.Errorf("transcribe: write response_format field: %w", err) | ||
| } | ||
| w.Close() |
There was a problem hiding this comment.
multipart.Writer.Close() can return an error; currently the result of w.Close() is ignored, which may lead to sending an incomplete/malformed multipart body and confusing API errors. Capture and return the Close error (as done in pkg/voice/transcriber.go).
| w.Close() | |
| if err = w.Close(); err != nil { | |
| return "", fmt.Errorf("transcribe: close multipart writer: %w", err) | |
| } |
| respBytes, err := io.ReadAll(resp.Body) | ||
| if err != nil { | ||
| return "", fmt.Errorf("transcribe: read response: %w", err) | ||
| } | ||
|
|
||
| if resp.StatusCode != http.StatusOK { | ||
| return "", fmt.Errorf("transcribe: groq returned %d: %s", resp.StatusCode, string(respBytes)) | ||
| } |
There was a problem hiding this comment.
io.ReadAll(resp.Body) reads the entire response into memory and the error path includes the full body in the returned error string. Consider bounding reads (e.g., with io.LimitReader) and truncating/redacting the body included in errors to avoid large allocations and accidental logging of unexpected content.
Transcribe inbound voice/audio messages before passing to the LLM. Reuses existing pkg/voice.GroqTranscriber (no duplicate implementation). - Add TranscribeConfig to ToolsConfig (enabled, api_key, model) - Wire transcriber into AgentLoop.processMessage - Use inferMediaType() to detect audio files (reuses existing helper) - Supports env vars: PICOCLAW_TRANSCRIBE_ENABLED, PICOCLAW_TRANSCRIBE_API_KEY - Zero overhead when disabled (transcriber is nil)
ae304f2 to
54887ca
Compare
virat-mankali
left a comment
There was a problem hiding this comment.
Updated the code to use the groq thing that was already present. and make it working properly for telegram inbound audio notes. So users can speak and this will handle smothly.
|
@copilot review the pr |
|
Please fix conflicts |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
pkg/agent/loop.go:1755
- There are extra blank lines at EOF. Running gofmt (or removing the trailing empty lines) will keep the file clean and avoid style-only diffs/noise in future changes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil { | ||
| for _, ref := range msg.Media { | ||
| localPath, meta, err := al.mediaStore.ResolveWithMeta(ref) | ||
| if err != nil { |
There was a problem hiding this comment.
Errors from ResolveWithMeta are swallowed silently (continue with no log/metric). This makes it very hard to diagnose why a voice note wasn't transcribed (e.g., non-media:// refs, expired store entries, missing files). Please log at least a warning with the ref and error (ideally consistent with transcribeAudioInMessage).
| if err != nil { | |
| if err != nil { | |
| logger.WarnCF("agent", "Failed to resolve media ref", map[string]any{ | |
| "ref": ref, | |
| "error": err.Error(), | |
| }) |
| // Handle audio transcription | ||
| if mediaType == "audio" { | ||
| logger.InfoCF("agent", "Transcribing audio", map[string]any{ | ||
| "ref": ref, "path": localPath, "filename": meta.Filename, |
There was a problem hiding this comment.
This log entry includes the resolved local filesystem path for the media ("path": localPath). File paths can leak host details into logs and may contain user-identifying information depending on storage location. Consider omitting path (or logging only a basename / media ref) and keep filename/ref for correlation.
| "ref": ref, "path": localPath, "filename": meta.Filename, | |
| "ref": ref, "filename": meta.Filename, |
| // openaiMessage is the wire-format message for OpenAI-compatible APIs. | ||
| // It mirrors protocoltypes.Message but omits SystemParts, which is an | ||
| // internal field that would be unknown to third-party endpoints. | ||
| // Content can be either a string or an array of content blocks for multimodal messages. |
There was a problem hiding this comment.
The new comment claims Content may be either a string or an array of multimodal content blocks, but openaiMessage.Content is still typed as string. This is misleading for future maintainers and API consumers; either update the field type/JSON marshaling to support both shapes or adjust the comment to match the actual wire format used here.
| // openaiMessage is the wire-format message for OpenAI-compatible APIs. | |
| // It mirrors protocoltypes.Message but omits SystemParts, which is an | |
| // internal field that would be unknown to third-party endpoints. | |
| // Content can be either a string or an array of content blocks for multimodal messages. | |
| // openaiMessage is the wire-format message for OpenAI-compatible APIs for | |
| // simple text-only messages. It mirrors protocoltypes.Message but omits | |
| // SystemParts, which is an internal field that would be unknown to | |
| // third-party endpoints. For messages that include media, serializeMessages | |
| // constructs a map[string]any where "content" is an array of content blocks, | |
| // instead of using this struct. |
| @@ -667,6 +673,7 @@ type ToolsConfig struct { | |||
| Exec ExecConfig `json:"exec"` | |||
| Skills SkillsToolsConfig `json:"skills"` | |||
| MediaCleanup MediaCleanupConfig `json:"media_cleanup"` | |||
| Transcribe TranscribeConfig `json:"transcribe"` | |||
There was a problem hiding this comment.
TranscribeConfig is added under ToolsConfig, but it is not referenced anywhere in the codebase (no reads of cfg.Tools.Transcribe.*). As a result, the documented tools.transcribe config/env vars won't actually enable/disable or configure transcription. Either wire cfg.Tools.Transcribe into transcriber detection/initialization (and honor Enabled/Model/APIKey) or remove this config block to avoid a dead configuration surface.
| userMessage := msg.Content | ||
|
|
||
| if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil { | ||
| for _, ref := range msg.Media { | ||
| localPath, meta, err := al.mediaStore.ResolveWithMeta(ref) | ||
| if err != nil { | ||
| continue | ||
| } | ||
|
|
||
| mediaType := inferMediaType(meta.Filename, meta.ContentType) | ||
|
|
||
| // Handle audio transcription | ||
| if mediaType == "audio" { | ||
| logger.InfoCF("agent", "Transcribing audio", map[string]any{ | ||
| "ref": ref, "path": localPath, "filename": meta.Filename, | ||
| }) | ||
| result, err := al.transcriber.Transcribe(ctx, localPath) | ||
| if err != nil { | ||
| logger.WarnCF("agent", "Audio transcription failed", map[string]any{ | ||
| "ref": ref, "error": err.Error(), | ||
| }) | ||
| continue | ||
| } | ||
| logger.InfoCF("agent", "Transcribed audio", map[string]any{ | ||
| "ref": ref, "length": len(result.Text), | ||
| }) | ||
| // Replace the [voice]/[audio] placeholder with the actual transcript | ||
| userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage) | ||
| userMessage = strings.TrimSpace(userMessage) | ||
| if userMessage != "" { | ||
| userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text | ||
| } else { | ||
| userMessage = result.Text | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
processMessage already calls al.transcribeAudioInMessage(ctx, msg) earlier, which resolves/transcribes audio media and updates msg.Content. The additional transcription loop introduced here will re-resolve and re-transcribe the same audio a second time, increasing latency/cost and potentially duplicating transcript text. Consider removing this block and relying on transcribeAudioInMessage, or refactor to a single transcription pass that both replaces placeholders and builds the final user message.
| userMessage := msg.Content | |
| if al.transcriber != nil && len(msg.Media) > 0 && al.mediaStore != nil { | |
| for _, ref := range msg.Media { | |
| localPath, meta, err := al.mediaStore.ResolveWithMeta(ref) | |
| if err != nil { | |
| continue | |
| } | |
| mediaType := inferMediaType(meta.Filename, meta.ContentType) | |
| // Handle audio transcription | |
| if mediaType == "audio" { | |
| logger.InfoCF("agent", "Transcribing audio", map[string]any{ | |
| "ref": ref, "path": localPath, "filename": meta.Filename, | |
| }) | |
| result, err := al.transcriber.Transcribe(ctx, localPath) | |
| if err != nil { | |
| logger.WarnCF("agent", "Audio transcription failed", map[string]any{ | |
| "ref": ref, "error": err.Error(), | |
| }) | |
| continue | |
| } | |
| logger.InfoCF("agent", "Transcribed audio", map[string]any{ | |
| "ref": ref, "length": len(result.Text), | |
| }) | |
| // Replace the [voice]/[audio] placeholder with the actual transcript | |
| userMessage = strings.NewReplacer("[voice]", "", "[audio]", "").Replace(userMessage) | |
| userMessage = strings.TrimSpace(userMessage) | |
| if userMessage != "" { | |
| userMessage = userMessage + "\n\n[Voice transcript]: " + result.Text | |
| } else { | |
| userMessage = result.Text | |
| } | |
| } | |
| } | |
| } | |
| // At this point, processMessage has already invoked transcribeAudioInMessage, | |
| // which resolves/transcribes audio media and updates msg.Content as needed. | |
| userMessage := msg.Content |
… fix comment, trim EOF
|
@virat-mankali Hi! This PR has had no activity for over 2 weeks, so I'm closing it for now to keep things organized. Feel free to reopen anytime if you'd like to continue. |
Summary
Adds inbound audio transcription support. When a user sends a voice note or audio file (via Telegram or any other channel), the audio is automatically transcribed using Groq's Whisper API before being passed to the LLM — so the agent can understand and respond to spoken messages naturally.
Changes
pkg/transcribe/groq.go— new package with a minimal Groq Whisper client using only stdlib (net/http,mime/multipart). No new dependencies.pkg/config/config.go— addsTranscribeConfigstruct andTranscribefield toToolsConfigpkg/agent/loop.go— wires transcriber intoprocessMessage, resolves audio media refs, calls Groq, and injects the transcript into the user message before the LLM sees itConfiguration
Add to
config.jsonundertools:Or via env vars:
PICOCLAW_TRANSCRIBE_ENABLED=truePICOCLAW_TRANSCRIBE_API_KEY=gsk_...PICOCLAW_TRANSCRIBE_MODEL=whisper-large-v3-turboNotes
.ogg,.mp3,.m4a,.wav,.webm,.opus,.flac[voice]/[audio]placeholders with the actual transcript in the user message