Skip to content

Speed Control, Energy Ramp-up, and Voice Gender Inconsistency when dealing #303

@4N1Z

Description

@4N1Z

First of all thank you for the model! Really wanted to appreciate the model for what it is. So awesome to see the model perform really well in realtime scenarios. The arabic dialect works like charm.

I am currently using the TTS model via Groq and have identified one feature request and two specific behaviors that I would like to report.

Feature Request: Speed Adjustment

While the neural output is high quality, having the ability to manually adjust the playback speed directly within the TTS engine would be a significant addition. Direct manipulation at the model level offers better consistency for real-time applications than post-processing.

Bug Reports

  1. Performance/Energy "Warm-up" Period
    I've noticed a consistent but strange behavior when using the Arabic model. The initial speech output is often slow and lacks energy. However, once the generation hits the 80–90 second mark, the model seems to "find its stride," significantly increasing in energy and reaching peak performance levels.
  • Observed Behavior: Slow/low energy start, followed by a noticeable performance boost after ~1.5 minutes.
  • Request: Any insights into why the model requires this "warm-up" period or if this is a known state-initialization issue?
  1. Random Voice Gender Switching (Female Voice)
    When the Female voice is selected, the initial stream occasionally outputs a male voice at complete random. After the first stream segment, the output reverts to the correctly selected female voice.
  • Context: This happens in consistent environments during real-time calls.
  • Frequency: Random, but specifically isolated to the female voice setting. The male voice does not seem to experience similar switching.

Technical Questions & Debugging

What do you think could be causing these inconsistencies? I would appreciate any insights into:

  • Likely root causes (e.g., KV caching issues, seed/initialization variance, or specific architectural quirks).
  • Suggested debugging or training strategies to stabilize the initial output energy and voice consistency.

Environment:

  • Provider: Groq
  • Language: Arabic
  • Use Case: Real-time voice agents/calls

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions