First of all thank you for the model! Really wanted to appreciate the model for what it is. So awesome to see the model perform really well in realtime scenarios. The arabic dialect works like charm.
I am currently using the TTS model via Groq and have identified one feature request and two specific behaviors that I would like to report.
Feature Request: Speed Adjustment
While the neural output is high quality, having the ability to manually adjust the playback speed directly within the TTS engine would be a significant addition. Direct manipulation at the model level offers better consistency for real-time applications than post-processing.
Bug Reports
- Performance/Energy "Warm-up" Period
I've noticed a consistent but strange behavior when using the Arabic model. The initial speech output is often slow and lacks energy. However, once the generation hits the 80–90 second mark, the model seems to "find its stride," significantly increasing in energy and reaching peak performance levels.
- Observed Behavior: Slow/low energy start, followed by a noticeable performance boost after ~1.5 minutes.
- Request: Any insights into why the model requires this "warm-up" period or if this is a known state-initialization issue?
- Random Voice Gender Switching (Female Voice)
When the Female voice is selected, the initial stream occasionally outputs a male voice at complete random. After the first stream segment, the output reverts to the correctly selected female voice.
- Context: This happens in consistent environments during real-time calls.
- Frequency: Random, but specifically isolated to the female voice setting. The male voice does not seem to experience similar switching.
Technical Questions & Debugging
What do you think could be causing these inconsistencies? I would appreciate any insights into:
- Likely root causes (e.g., KV caching issues, seed/initialization variance, or specific architectural quirks).
- Suggested debugging or training strategies to stabilize the initial output energy and voice consistency.
Environment:
- Provider: Groq
- Language: Arabic
- Use Case: Real-time voice agents/calls
First of all thank you for the model! Really wanted to appreciate the model for what it is. So awesome to see the model perform really well in realtime scenarios. The arabic dialect works like charm.
I am currently using the TTS model via Groq and have identified one feature request and two specific behaviors that I would like to report.
Feature Request: Speed Adjustment
While the neural output is high quality, having the ability to manually adjust the playback speed directly within the TTS engine would be a significant addition. Direct manipulation at the model level offers better consistency for real-time applications than post-processing.
Bug Reports
I've noticed a consistent but strange behavior when using the Arabic model. The initial speech output is often slow and lacks energy. However, once the generation hits the 80–90 second mark, the model seems to "find its stride," significantly increasing in energy and reaching peak performance levels.
When the Female voice is selected, the initial stream occasionally outputs a male voice at complete random. After the first stream segment, the output reverts to the correctly selected female voice.
Technical Questions & Debugging
What do you think could be causing these inconsistencies? I would appreciate any insights into:
Environment: