Skip to content

background noise at the start of generation #299

@thierryatboxco

Description

@thierryatboxco

Hello.

At the start of generation, we observed a non-negligible period of background noise (400 ms for example) before the audio payload.
This is not "silence" (PCM=0) but small values of PCM (background noise).

Example with following setup:

  • Ubuntu 24.04
  • NVIDIA RTX 5090 (32 GB)
  • Orpheus-TTS commit e64661f (latest)
  • model is canopylabs/orpheus-3b-0.1-ft
  • VLLM 0.15.1

Other combinations show the same problem. Idem with client parameters like "temperature", "repetition_penalty", etc.

To show the issue, we generate a simple text ("Hello"), then read the custom tokens with associated timestamp. When all the custom tokens are read, we use SNAC to generate the PCM, then compute a RMS to detect the useful audio.
The "time" field indicates when tokens have been received. The "realtime" field indicates when audio has to be played.


  frame  time (ms)  realtime (ms)  RMS*1000  status
      0        130            130     2.886  silent
      1        165            215     2.886  silent
      2        201            301     3.117  silent
      3        237            386     2.621  silent
      4        273            471     1.755  silent
      5        309            557     1.378  silent
      6        345            642     5.958  silent
      7        381            727    62.029  NON-SILENT
      8        417            813    90.580  NON-SILENT
      9        453            898    91.245  NON-SILENT
     10        489            983    87.851  NON-SILENT
     11        526           1069    70.061  NON-SILENT
     12        562           1154    37.333  NON-SILENT
     13        598           1239     8.673  silent
     14        634           1325     1.586  silent
     15        670           1410     1.128  silent
     16        706           1495     1.206  silent
     17        743           1581     0.806  silent
     18        779           1666     0.667  silent

In this example, we have a Time-To-First-Byte (TTFB) of 130 ms, but the Time To First Useful Byte is 381 ms. Even in dropping the first seven frames, we cannot reach a better delay than 381 ms.

It seems that another person has already reported the problem in issue #228

We integrate Orpheus TTS in an ultra-low latency system, and, due to this problem, we cannot reach the announced TTFB.

How is it possible the reach the announced TTFB without background noise at the beginning?

Is there a method to remove the problem with LoRa fine-tuning (and what LoRa configuration)? We can perform a home made fine-tuning.

Many thanks for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions