Hello.
At the start of generation, we observed a non-negligible period of background noise (400 ms for example) before the audio payload.
This is not "silence" (PCM=0) but small values of PCM (background noise).
Example with following setup:
- Ubuntu 24.04
- NVIDIA RTX 5090 (32 GB)
- Orpheus-TTS commit e64661f (latest)
- model is canopylabs/orpheus-3b-0.1-ft
- VLLM 0.15.1
Other combinations show the same problem. Idem with client parameters like "temperature", "repetition_penalty", etc.
To show the issue, we generate a simple text ("Hello"), then read the custom tokens with associated timestamp. When all the custom tokens are read, we use SNAC to generate the PCM, then compute a RMS to detect the useful audio.
The "time" field indicates when tokens have been received. The "realtime" field indicates when audio has to be played.
frame time (ms) realtime (ms) RMS*1000 status
0 130 130 2.886 silent
1 165 215 2.886 silent
2 201 301 3.117 silent
3 237 386 2.621 silent
4 273 471 1.755 silent
5 309 557 1.378 silent
6 345 642 5.958 silent
7 381 727 62.029 NON-SILENT
8 417 813 90.580 NON-SILENT
9 453 898 91.245 NON-SILENT
10 489 983 87.851 NON-SILENT
11 526 1069 70.061 NON-SILENT
12 562 1154 37.333 NON-SILENT
13 598 1239 8.673 silent
14 634 1325 1.586 silent
15 670 1410 1.128 silent
16 706 1495 1.206 silent
17 743 1581 0.806 silent
18 779 1666 0.667 silent
In this example, we have a Time-To-First-Byte (TTFB) of 130 ms, but the Time To First Useful Byte is 381 ms. Even in dropping the first seven frames, we cannot reach a better delay than 381 ms.
It seems that another person has already reported the problem in issue #228
We integrate Orpheus TTS in an ultra-low latency system, and, due to this problem, we cannot reach the announced TTFB.
How is it possible the reach the announced TTFB without background noise at the beginning?
Is there a method to remove the problem with LoRa fine-tuning (and what LoRa configuration)? We can perform a home made fine-tuning.
Many thanks for your help.
Hello.
At the start of generation, we observed a non-negligible period of background noise (400 ms for example) before the audio payload.
This is not "silence" (PCM=0) but small values of PCM (background noise).
Example with following setup:
Other combinations show the same problem. Idem with client parameters like "temperature", "repetition_penalty", etc.
To show the issue, we generate a simple text ("Hello"), then read the custom tokens with associated timestamp. When all the custom tokens are read, we use SNAC to generate the PCM, then compute a RMS to detect the useful audio.
The "time" field indicates when tokens have been received. The "realtime" field indicates when audio has to be played.
In this example, we have a Time-To-First-Byte (TTFB) of 130 ms, but the Time To First Useful Byte is 381 ms. Even in dropping the first seven frames, we cannot reach a better delay than 381 ms.
It seems that another person has already reported the problem in issue #228
We integrate Orpheus TTS in an ultra-low latency system, and, due to this problem, we cannot reach the announced TTFB.
How is it possible the reach the announced TTFB without background noise at the beginning?
Is there a method to remove the problem with LoRa fine-tuning (and what LoRa configuration)? We can perform a home made fine-tuning.
Many thanks for your help.