Skip to content

feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support#1905

Merged
huaaudio merged 4 commits intosipeed:mainfrom
manaporkun:feat/elevenlabs-transcriber-and-voice
Mar 23, 2026
Merged

feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support#1905
huaaudio merged 4 commits intosipeed:mainfrom
manaporkun:feat/elevenlabs-transcriber-and-voice

Conversation

@manaporkun
Copy link
Copy Markdown
Contributor

@manaporkun manaporkun commented Mar 22, 2026

Summary

  • Add ElevenLabsTranscriber as an alternative speech-to-text provider using the ElevenLabs Scribe API (scribe_v1). This enables voice message transcription for users who already have an ElevenLabs API key, without requiring a separate Groq account.
  • Add Telegram SendVoice support so voice messages are sent as proper voice bubbles instead of audio file attachments.

Changes

File Change
pkg/voice/elevenlabs_transcriber.go New ElevenLabsTranscriber struct implementing the Transcriber interface. Uses xi-api-key header auth and scribe_v1 model.
pkg/voice/transcriber.go Updated DetectTranscriber to check voice.elevenlabs_api_key first, falling back to Groq model-list entries for backward compatibility.
pkg/config/config.go Added ElevenLabsAPIKey field to VoiceConfig.
pkg/channels/telegram/telegram.go Telegram-specific voice-bubble detection inside "audio" case — OGG files with "voice" in filename use SendVoice, others use SendAudio. No changes to inferMediaType or other channels.
pkg/voice/elevenlabs_transcriber_test.go TestElevenLabsTranscribe suite (success, API error, missing file).
pkg/voice/transcriber_test.go DetectTranscriber priority tests for ElevenLabs, Groq, and voice model name.

Configuration

{
  "voice": {
    "elevenlabs_api_key": "sk_your_key_here"
  }
}

Or via environment variable: PICOCLAW_VOICE_ELEVENLABS_API_KEY=sk_your_key_here

When configured, ElevenLabs takes priority over Groq for transcription. Existing Groq configurations (via model_list) continue to work unchanged.

Test plan

  • All existing tests pass
  • New ElevenLabs transcriber tests pass (success, API error, missing file)
  • DetectTranscriber correctly prioritizes: voice model name > ElevenLabs > Groq model-list
  • Interface compliance verified at compile time
  • Tested on Raspberry Pi Zero 2 W with real ElevenLabs API
  • Telegram voice bubbles render correctly on iOS/Android

Copilot AI review requested due to automatic review settings March 22, 2026 23:03
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 22, 2026

CLA assistant check
All committers have signed the CLA.

@manaporkun manaporkun force-pushed the feat/elevenlabs-transcriber-and-voice branch from 9841a94 to 3b2e19a Compare March 22, 2026 23:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an ElevenLabs Scribe-based speech-to-text transcriber option and improves Telegram outbound media handling by sending OGG “voice note” media as Telegram voice bubbles.

Changes:

  • Added ElevenLabsTranscriber (Scribe STT) and updated DetectTranscriber to prefer ElevenLabs when configured.
  • Added Telegram outbound "voice" handling via SendVoice.
  • Introduced filename-based "voice" media type inference for OGG/OGA files containing "voice" in the filename.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/voice/transcriber.go Adds ElevenLabs STT transcriber and updates provider detection priority.
pkg/voice/transcriber_test.go Adds interface compliance + ElevenLabs transcribe tests and provider priority tests.
pkg/config/config.go Extends ProvidersConfig with ElevenLabs and updates IsEmpty().
pkg/agent/loop.go Adds "voice" inference in inferMediaType based on filename for OGG/OGA.
pkg/channels/telegram/telegram.go Adds "voice" branch to send Telegram voice bubbles via SendVoice.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/agent/loop.go Outdated
Comment on lines +1135 to +1140
// Detect voice messages: OGG files with "voice" in the filename.
// These are sent as Telegram voice bubbles rather than audio attachments.
if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) {
return "voice"
}

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferMediaType now returns "voice" for files named like "voice.ogg". This value is propagated into bus.MediaPart.Type for all channels (see where tool media parts are built), but only Telegram has been updated to handle "voice". Channels that map types explicitly (e.g., OneBot maps only "audio" -> "record", WeCom maps only "audio" -> "voice") will now treat these as generic files, which is a functional regression for non-Telegram channels. Consider keeping the inferred type as "audio" and doing Telegram-specific voice-bubble detection inside the Telegram channel, or ensure every channel that handles audio also treats "voice" as audio-equivalent.

Suggested change
// Detect voice messages: OGG files with "voice" in the filename.
// These are sent as Telegram voice bubbles rather than audio attachments.
if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) {
return "voice"
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Moved voice-bubble detection out of inferMediaType (which now always returns "audio" for OGG files) and into the Telegram channel's Send method. Other channels are unaffected.

@afjcjsbx
Copy link
Copy Markdown
Collaborator

a new branch was mergiated, please could you fix the conflicts? 🙏

@manaporkun manaporkun force-pushed the feat/elevenlabs-transcriber-and-voice branch from 9095a02 to 4b6cdd1 Compare March 22, 2026 23:35
Copilot AI review requested due to automatic review settings March 22, 2026 23:35
@manaporkun
Copy link
Copy Markdown
Contributor Author

Rebased on latest main and resolved conflicts. Adapted to the new transcriber architecture — ElevenLabs transcriber is now in its own file (elevenlabs_transcriber.go) matching the new pattern, and DetectTranscriber checks voice model name first, then ElevenLabs, then Groq. All 25 tests pass.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +484 to +488
// Send OGG files with "voice" in the filename as Telegram voice
// bubbles (SendVoice) instead of audio attachments (SendAudio).
fn := strings.ToLower(part.Filename)
if strings.Contains(fn, "voice") && (strings.HasSuffix(fn, ".ogg") || strings.HasSuffix(fn, ".oga")) {
vparams := &telego.SendVoiceParams{
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The voice-bubble detection uses strings.Contains(fn, "voice"), which will also match unrelated filenames like invoice.ogg and send them as Telegram voice messages incorrectly. Consider using a stricter check (e.g., HasPrefix/HasSuffix on the base name with a delimiter, or a regex for (^|[^a-z0-9])voice([^a-z0-9]|$)) so only intended voice clips are routed to SendVoice.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +116
logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{
"status_code": resp.StatusCode,
"response": string(body),
})
return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, string(body))
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On non-200 responses, the code logs and returns the full response body ("response": string(body) and includes it in the returned error). This can leak potentially sensitive information and can also produce very large logs/errors if the upstream returns HTML or verbose JSON. Consider truncating/sanitizing the body in logs/errors (and optionally parsing a structured error field) while still preserving enough detail for debugging.

Suggested change
logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{
"status_code": resp.StatusCode,
"response": string(body),
})
return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, string(body))
truncatedBody := utils.Truncate(string(body), 512)
logger.ErrorCF("voice", "ElevenLabs API error", map[string]any{
"status_code": resp.StatusCode,
"response": truncatedBody,
})
return nil, fmt.Errorf("ElevenLabs API error (status %d): %s", resp.StatusCode, truncatedBody)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@afjcjsbx afjcjsbx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sipeed-bot sipeed-bot Bot added type: enhancement New feature or request domain: provider domain: channel go Pull requests that update go code labels Mar 22, 2026
@afjcjsbx
Copy link
Copy Markdown
Collaborator

pls fix lint 🙏

@manaporkun manaporkun force-pushed the feat/elevenlabs-transcriber-and-voice branch from 4b6cdd1 to 9cff4e4 Compare March 22, 2026 23:48
Copilot AI review requested due to automatic review settings March 23, 2026 11:31
@manaporkun manaporkun force-pushed the feat/elevenlabs-transcriber-and-voice branch from 9cff4e4 to 1451bb1 Compare March 23, 2026 11:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/config/config.go
Comment on lines 911 to 915
type VoiceConfig struct {
ModelName string `json:"model_name,omitempty" env:"PICOCLAW_VOICE_MODEL_NAME"`
EchoTranscription bool `json:"echo_transcription" env:"PICOCLAW_VOICE_ECHO_TRANSCRIPTION"`
ModelName string `json:"model_name,omitempty" env:"PICOCLAW_VOICE_MODEL_NAME"`
EchoTranscription bool `json:"echo_transcription" env:"PICOCLAW_VOICE_ECHO_TRANSCRIPTION"`
ElevenLabsAPIKey string `json:"elevenlabs_api_key,omitempty" env:"PICOCLAW_VOICE_ELEVENLABS_API_KEY"`
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/config snippet refer to providers.elevenlabs.api_key, but the implementation adds voice.elevenlabs_api_key (and DetectTranscriber reads cfg.Voice.ElevenLabsAPIKey). This mismatch will cause users following the documented JSON to get no ElevenLabs transcriber. Either wire ElevenLabs through providers.elevenlabs as described, or update the docs/PR description and any config examples to match the voice section.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The original implementation used providers.elevenlabs, but during rebase onto latest main, the ProvidersConfig struct was removed (provider refactoring to model_list). I moved the ElevenLabs API key to voice.elevenlabs_api_key instead, which fits better since it's specifically for voice transcription. Updated the PR description and config examples to match the actual implementation.

Copy link
Copy Markdown
Collaborator

@huaaudio huaaudio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @manaporkun , thanks for the PR! This config logic is pretty good. Can you run make fmt && make lint locally to fix the lint issue before we proceed and merge? Thanks

manaporkun and others added 4 commits March 23, 2026 21:55
…pport

Add ElevenLabsTranscriber as an alternative speech-to-text provider using
the ElevenLabs Scribe API (scribe_v1). This enables voice message
transcription for users who already have an ElevenLabs API key, without
requiring a separate Groq account.

Changes:
- Add ElevenLabsTranscriber implementing the Transcriber interface
- Update DetectTranscriber to check providers.elevenlabs.api_key first,
  falling back to Groq for backward compatibility
- Add ElevenLabs to ProvidersConfig
- Add "voice" media type for OGG files with "voice" in filename
- Add SendVoice support in Telegram channel for voice bubble messages
- Add comprehensive tests for ElevenLabs transcriber

Configuration:
  "providers": {
    "elevenlabs": {
      "api_key": "sk_your_key_here"
    }
  }

Closes sipeed#1503 (partial)
…ssion in other channels

Address review feedback: keep inferMediaType returning "audio" for all
OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done
inside the Telegram channel based on filename, so other channels that
map "audio" explicitly are unaffected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 23, 2026 20:56
@manaporkun manaporkun force-pushed the feat/elevenlabs-transcriber-and-voice branch from 8fd098f to 8ab96fd Compare March 23, 2026 20:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@huaaudio huaaudio merged commit dd9adf8 into sipeed:main Mar 23, 2026
7 of 8 checks passed
@manaporkun manaporkun deleted the feat/elevenlabs-transcriber-and-voice branch March 23, 2026 21:12
@Orgmar
Copy link
Copy Markdown
Contributor

Orgmar commented Mar 24, 2026

@manaporkun Nice contribution! The ElevenLabs Scribe transcriber gives users another solid STT option, and the priority chain (voice model > ElevenLabs > Groq) keeps backward compatibility clean. The Telegram SendVoice bubble support is a great UX improvement too, way better than raw audio attachments. Thorough test coverage across the board.

We're running a PicoClaw Dev Group on Discord for contributors to chat and collaborate. If you're interested, email support@sipeed.com with subject [Join PicoClaw Dev Group] manaporkun and we'll get you the invite!

manaporkun added a commit to manaporkun/talking-flower that referenced this pull request Mar 24, 2026
- Convert HEIC build photos to JPEG in docs/images/ with descriptive names
- Rewrite README with hero image, build story, and architecture overview
- Rename picoclaw/ to character/ (persona files, not the tool itself)
- Update hardware.md with full audio config: MAX98357A dtoverlay, ALSA
  dmix+softvol, Pi Zero 2W over-amplification fix, USB mic tuning
- Mention upstream ElevenLabs TTS contribution (sipeed/picoclaw#1905)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JerryWang-xmu pushed a commit to JerryWang-xmu/picoclaw that referenced this pull request Mar 24, 2026
…pport (sipeed#1905)

* feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support

Add ElevenLabsTranscriber as an alternative speech-to-text provider using
the ElevenLabs Scribe API (scribe_v1). This enables voice message
transcription for users who already have an ElevenLabs API key, without
requiring a separate Groq account.

Changes:
- Add ElevenLabsTranscriber implementing the Transcriber interface
- Update DetectTranscriber to check providers.elevenlabs.api_key first,
  falling back to Groq for backward compatibility
- Add ElevenLabs to ProvidersConfig
- Add "voice" media type for OGG files with "voice" in filename
- Add SendVoice support in Telegram channel for voice bubble messages
- Add comprehensive tests for ElevenLabs transcriber

Configuration:
  "providers": {
    "elevenlabs": {
      "api_key": "sk_your_key_here"
    }
  }

Closes sipeed#1503 (partial)

* fix: move voice-bubble detection into Telegram channel to avoid regression in other channels

Address review feedback: keep inferMediaType returning "audio" for all
OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done
inside the Telegram channel based on filename, so other channels that
map "audio" explicitly are unaffected.

* fix: align VoiceConfig struct tags to pass golines formatter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(agent): use ModelName in loop test added by upstream

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
andressg79 pushed a commit to andressg79/picoclaw that referenced this pull request Mar 30, 2026
…pport (sipeed#1905)

* feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support

Add ElevenLabsTranscriber as an alternative speech-to-text provider using
the ElevenLabs Scribe API (scribe_v1). This enables voice message
transcription for users who already have an ElevenLabs API key, without
requiring a separate Groq account.

Changes:
- Add ElevenLabsTranscriber implementing the Transcriber interface
- Update DetectTranscriber to check providers.elevenlabs.api_key first,
  falling back to Groq for backward compatibility
- Add ElevenLabs to ProvidersConfig
- Add "voice" media type for OGG files with "voice" in filename
- Add SendVoice support in Telegram channel for voice bubble messages
- Add comprehensive tests for ElevenLabs transcriber

Configuration:
  "providers": {
    "elevenlabs": {
      "api_key": "sk_your_key_here"
    }
  }

Closes sipeed#1503 (partial)

* fix: move voice-bubble detection into Telegram channel to avoid regression in other channels

Address review feedback: keep inferMediaType returning "audio" for all
OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done
inside the Telegram channel based on filename, so other channels that
map "audio" explicitly are unaffected.

* fix: align VoiceConfig struct tags to pass golines formatter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(agent): use ModelName in loop test added by upstream

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ra1phdd pushed a commit to ra1phdd/picoclaw-pkg that referenced this pull request Apr 12, 2026
…pport (sipeed#1905)

* feat: add ElevenLabs Scribe STT transcriber and Telegram SendVoice support

Add ElevenLabsTranscriber as an alternative speech-to-text provider using
the ElevenLabs Scribe API (scribe_v1). This enables voice message
transcription for users who already have an ElevenLabs API key, without
requiring a separate Groq account.

Changes:
- Add ElevenLabsTranscriber implementing the Transcriber interface
- Update DetectTranscriber to check providers.elevenlabs.api_key first,
  falling back to Groq for backward compatibility
- Add ElevenLabs to ProvidersConfig
- Add "voice" media type for OGG files with "voice" in filename
- Add SendVoice support in Telegram channel for voice bubble messages
- Add comprehensive tests for ElevenLabs transcriber

Configuration:
  "providers": {
    "elevenlabs": {
      "api_key": "sk_your_key_here"
    }
  }

Closes sipeed#1503 (partial)

* fix: move voice-bubble detection into Telegram channel to avoid regression in other channels

Address review feedback: keep inferMediaType returning "audio" for all
OGG files. Voice-bubble detection (SendVoice vs SendAudio) is now done
inside the Telegram channel based on filename, so other channels that
map "audio" explicitly are unaffected.

* fix: align VoiceConfig struct tags to pass golines formatter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(agent): use ModelName in loop test added by upstream

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: channel domain: provider go Pull requests that update go code type: enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants