feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support#1905
Conversation
9841a94 to
3b2e19a
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an ElevenLabs Scribe-based speech-to-text transcriber option and improves Telegram outbound media handling by sending OGG “voice note” media as Telegram voice bubbles.
Changes:
- Added
ElevenLabsTranscriber(Scribe STT) and updatedDetectTranscriberto prefer ElevenLabs when configured. - Added Telegram outbound
"voice"handling viaSendVoice. - Introduced filename-based
"voice"media type inference for OGG/OGA files containing"voice"in the filename.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
pkg/voice/transcriber.go |
Adds ElevenLabs STT transcriber and updates provider detection priority. |
pkg/voice/transcriber_test.go |
Adds interface compliance + ElevenLabs transcribe tests and provider priority tests. |
pkg/config/config.go |
Extends ProvidersConfig with ElevenLabs and updates IsEmpty(). |
pkg/agent/loop.go |
Adds "voice" inference in inferMediaType based on filename for OGG/OGA. |
pkg/channels/telegram/telegram.go |
Adds "voice" branch to send Telegram voice bubbles via SendVoice. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Detect voice messages: OGG files with "voice" in the filename. | ||
| // These are sent as Telegram voice bubbles rather than audio attachments. | ||
| if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) { | ||
| return "voice" | ||
| } | ||
|
|
There was a problem hiding this comment.
inferMediaType now returns "voice" for files named like "voice.ogg". This value is propagated into bus.MediaPart.Type for all channels (see where tool media parts are built), but only Telegram has been updated to handle "voice". Channels that map types explicitly (e.g., OneBot maps only "audio" -> "record", WeCom maps only "audio" -> "voice") will now treat these as generic files, which is a functional regression for non-Telegram channels. Consider keeping the inferred type as "audio" and doing Telegram-specific voice-bubble detection inside the Telegram channel, or ensure every channel that handles audio also treats "voice" as audio-equivalent.
| // Detect voice messages: OGG files with "voice" in the filename. | |
| // These are sent as Telegram voice bubbles rather than audio attachments. | |
| if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) { | |
| return "voice" | |
| } |
There was a problem hiding this comment.
Good catch — fixed. Moved voice-bubble detection out of inferMediaType (which now always returns "audio" for OGG files) and into the Telegram channel's Send method. Other channels are unaffected.
|
a new branch was mergiated, please could you fix the conflicts? 🙏 |
9095a02 to
4b6cdd1
Compare
|
Rebased on latest main and resolved conflicts. Adapted to the new transcriber architecture — ElevenLabs transcriber is now in its own file ( |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Send OGG files with "voice" in the filename as Telegram voice | ||
| // bubbles (SendVoice) instead of audio attachments (SendAudio). | ||
| fn := strings.ToLower(part.Filename) | ||
| if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) { | ||
| vparams := &telego.SendVoiceParams{ |
There was a problem hiding this comment.
The voice-bubble detection uses strings.Contains(fn, "voice"), which will also match unrelated filenames like invoice.ogg and send them as Telegram voice messages incorrectly. Consider using a stricter check (e.g., HasPrefix/HasSuffix on the base name with a delimiter, or a regex for (^|[^a-z0-9])voice([^a-z0-9]|$)) so only intended voice clips are routed to SendVoice.
| logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{ | ||
| "status_code": resp.StatusCode, | ||
| "response": string(body), | ||
| }) | ||
| return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, string(body)) |
There was a problem hiding this comment.
On non-200 responses, the code logs and returns the full response body ("response": string(body) and includes it in the returned error). This can leak potentially sensitive information and can also produce very large logs/errors if the upstream returns HTML or verbose JSON. Consider truncating/sanitizing the body in logs/errors (and optionally parsing a structured error field) while still preserving enough detail for debugging.
| logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{ | |
| "status_code": resp.StatusCode, | |
| "response": string(body), | |
| }) | |
| return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, string(body)) | |
| truncatedBody := utils.Truncate(string(body), 512) | |
| logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{ | |
| "status_code": resp.StatusCode, | |
| "response": truncatedBody, | |
| }) | |
| return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, truncatedBody) |
|
pls fix lint 🙏 |
4b6cdd1 to
9cff4e4
Compare
9cff4e4 to
1451bb1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type VoiceConfig struct { | ||
| ModelName string `json:"model_name,omitempty" env:"PICOCLAW_VOICE_MODEL_NAME"` | ||
| EchoTranscription bool `json:"echo_transcription" env:"PICOCLAW_VOICE_ECHO_TRANSCRIPTION"` | ||
| ModelName string `json:"model_name,omitempty" env:"PICOCLAW_VOICE_MODEL_NAME"` | ||
| EchoTranscription bool `json:"echo_transcription" env:"PICOCLAW_VOICE_ECHO_TRANSCRIPTION"` | ||
| ElevenLabsAPIKey string `json:"elevenlabs_api_key,omitempty" env:"PICOCLAW_VOICE_ELEVENLABS_API_KEY"` | ||
| } |
There was a problem hiding this comment.
The PR description/config snippet refer to providers.elevenlabs.api_key, but the implementation adds voice.elevenlabs_api_key (and DetectTranscriber reads cfg.Voice.ElevenLabsAPIKey). This mismatch will cause users following the documented JSON to get no ElevenLabs transcriber. Either wire ElevenLabs through providers.elevenlabs as described, or update the docs/PR description and any config examples to match the voice section.
There was a problem hiding this comment.
Good catch! The original implementation used providers.elevenlabs, but during rebase onto latest main, the ProvidersConfig struct was removed (provider refactoring to model_list). I moved the ElevenLabs API key to voice.elevenlabs_api_key instead, which fits better since it's specifically for voice transcription. Updated the PR description and config examples to match the actual implementation.
huaaudio
left a comment
There was a problem hiding this comment.
Hi @manaporkun , thanks for the PR! This config logic is pretty good. Can you run make fmt && make lint locally to fix the lint issue before we proceed and merge? Thanks
…pport
Add ElevenLabsTranscriber as an alternative speech-to-text provider using
the ElevenLabs Scribe API (scribe_v1). This enables voice message
transcription for users who already have an ElevenLabs API key, without
requiring a separate Groq account.
Changes:
- Add ElevenLabsTranscriber implementing the Transcriber interface
- Update DetectTranscriber to check providers.elevenlabs.api_key first,
falling back to Groq for backward compatibility
- Add ElevenLabs to ProvidersConfig
- Add "voice" media type for OGG files with "voice" in filename
- Add SendVoice support in Telegram channel for voice bubble messages
- Add comprehensive tests for ElevenLabs transcriber
Configuration:
"providers": {
"elevenlabs": {
"api_key": "sk_your_key_here"
}
}
Closes sipeed#1503 (partial)
…ssion in other channels Address review feedback: keep inferMediaType returning "audio" for all OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done inside the Telegram channel based on filename, so other channels that map "audio" explicitly are unaffected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8fd098f to
8ab96fd
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@manaporkun Nice contribution! The ElevenLabs Scribe transcriber gives users another solid STT option, and the priority chain (voice model > ElevenLabs > Groq) keeps backward compatibility clean. The Telegram SendVoice bubble support is a great UX improvement too, way better than raw audio attachments. Thorough test coverage across the board. We're running a PicoClaw Dev Group on Discord for contributors to chat and collaborate. If you're interested, email |
- Convert HEIC build photos to JPEG in docs/images/ with descriptive names - Rewrite README with hero image, build story, and architecture overview - Rename picoclaw/ to character/ (persona files, not the tool itself) - Update hardware.md with full audio config: MAX98357A dtoverlay, ALSA dmix+softvol, Pi Zero 2W over-amplification fix, USB mic tuning - Mention upstream ElevenLabs TTS contribution (sipeed/picoclaw#1905) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pport (sipeed#1905) * feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support Add ElevenLabsTranscriber as an alternative speech-to-text provider using the ElevenLabs Scribe API (scribe_v1). This enables voice message transcription for users who already have an ElevenLabs API key, without requiring a separate Groq account. Changes: - Add ElevenLabsTranscriber implementing the Transcriber interface - Update DetectTranscriber to check providers.elevenlabs.api_key first, falling back to Groq for backward compatibility - Add ElevenLabs to ProvidersConfig - Add "voice" media type for OGG files with "voice" in filename - Add SendVoice support in Telegram channel for voice bubble messages - Add comprehensive tests for ElevenLabs transcriber Configuration: "providers": { "elevenlabs": { "api_key": "sk_your_key_here" } } Closes sipeed#1503 (partial) * fix: move voice-bubble detection into Telegram channel to avoid regression in other channels Address review feedback: keep inferMediaType returning "audio" for all OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done inside the Telegram channel based on filename, so other channels that map "audio" explicitly are unaffected. * fix: align VoiceConfig struct tags to pass golines formatter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(agent): use ModelName in loop test added by upstream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…pport (sipeed#1905) * feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support Add ElevenLabsTranscriber as an alternative speech-to-text provider using the ElevenLabs Scribe API (scribe_v1). This enables voice message transcription for users who already have an ElevenLabs API key, without requiring a separate Groq account. Changes: - Add ElevenLabsTranscriber implementing the Transcriber interface - Update DetectTranscriber to check providers.elevenlabs.api_key first, falling back to Groq for backward compatibility - Add ElevenLabs to ProvidersConfig - Add "voice" media type for OGG files with "voice" in filename - Add SendVoice support in Telegram channel for voice bubble messages - Add comprehensive tests for ElevenLabs transcriber Configuration: "providers": { "elevenlabs": { "api_key": "sk_your_key_here" } } Closes sipeed#1503 (partial) * fix: move voice-bubble detection into Telegram channel to avoid regression in other channels Address review feedback: keep inferMediaType returning "audio" for all OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done inside the Telegram channel based on filename, so other channels that map "audio" explicitly are unaffected. * fix: align VoiceConfig struct tags to pass golines formatter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(agent): use ModelName in loop test added by upstream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…pport (sipeed#1905) * feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support Add ElevenLabsTranscriber as an alternative speech-to-text provider using the ElevenLabs Scribe API (scribe_v1). This enables voice message transcription for users who already have an ElevenLabs API key, without requiring a separate Groq account. Changes: - Add ElevenLabsTranscriber implementing the Transcriber interface - Update DetectTranscriber to check providers.elevenlabs.api_key first, falling back to Groq for backward compatibility - Add ElevenLabs to ProvidersConfig - Add "voice" media type for OGG files with "voice" in filename - Add SendVoice support in Telegram channel for voice bubble messages - Add comprehensive tests for ElevenLabs transcriber Configuration: "providers": { "elevenlabs": { "api_key": "sk_your_key_here" } } Closes sipeed#1503 (partial) * fix: move voice-bubble detection into Telegram channel to avoid regression in other channels Address review feedback: keep inferMediaType returning "audio" for all OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done inside the Telegram channel based on filename, so other channels that map "audio" explicitly are unaffected. * fix: align VoiceConfig struct tags to pass golines formatter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(agent): use ModelName in loop test added by upstream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
scribe_v1). This enables voice message transcription for users who already have an ElevenLabs API key, without requiring a separate Groq account.SendVoicesupport so voice messages are sent as proper voice bubbles instead of audio file attachments.Changes
pkg/voice/elevenlabs_transcriber.goElevenLabsTranscriberstruct implementing theTranscriberinterface. Usesxi-api-keyheader auth andscribe_v1model.pkg/voice/transcriber.goDetectTranscriberto checkvoice.elevenlabs_api_keyfirst, falling back to Groq model-list entries for backward compatibility.pkg/config/config.goElevenLabsAPIKeyfield toVoiceConfig.pkg/channels/telegram/telegram.go"audio"case — OGG files with "voice" in filename useSendVoice, others useSendAudio. No changes toinferMediaTypeor other channels.pkg/voice/elevenlabs_transcriber_test.goTestElevenLabsTranscribesuite (success, API error, missing file).pkg/voice/transcriber_test.goDetectTranscriberpriority tests for ElevenLabs, Groq, and voice model name.Configuration
{ "voice": { "elevenlabs_api_key": "sk_your_key_here" } }Or via environment variable:
PICOCLAW_VOICE_ELEVENLABS_API_KEY=sk_your_key_hereWhen configured, ElevenLabs takes priority over Groq for transcription. Existing Groq configurations (via
model_list) continue to work unchanged.Test plan
DetectTranscribercorrectly prioritizes: voice model name > ElevenLabs > Groq model-list