Text-to-Speech
Safetensors
GGUF
qwen2
audio
speech
speech-language-models
conversational

Not currently realistic

#6
by ncsta - opened

I applaud you for what you're doing here but I think it's a little misleading. "Inference Speed: Real-time generation on mid-range devices". I don't see even with streaming that you're gonna get this to real time. I'm getting 15x on CPUs. Your GGUF is gonna compress it, but it's not gonna speed it up.

One-Time Setup Costs (do once):

  • Model loading: 174.93s (~3 minutes)
  • Voice encoding: 74.56s (~1.2 minutes)
  • Total setup: ~4 minutes

Runtime Inference Speed (what you care about):

After everything is loaded, the system generates speech at:

Average Real-Time Factor: 14.66x

This means:

  • To generate 1 second of audio → takes ~14.66 seconds of CPU processing
  • To generate 10 seconds of audio → takes ~2.5 minutes

Individual Test Results:

  1. Sentence 1 (2.82s audio): 46.28s → 16.41x RTF
  2. Sentence 2 (4.32s audio): 59.91s → 13.87x RTF
  3. Sentence 3 (3.48s audio): 49.75s → 14.30x RTF
  4. Sentence 4 (2.68s audio): 39.03s → 14.56x RTF

Please don't take this as discouragement. I hope you can get it working. And I think it's great that someone's working on that because everyone I was working on GPUs. And most people don't have GPUs. So I look forward to seeing the real-time speed when it occurs. I'm not sure if you can convert it to Onyx using Optimum or something like that. Maybe that'll get you closer to real time. (Although I've had trouble with the inputs and outputs of Optimum for the text to speech specifically. But I've seen it been done before, for certain TTS's).

Good luck with it and I hope you work it out.

Neuphonic org

Hey there - thanks for this. We have this running in realtime on a few different devices: mines a MacBook Pro m3 that’s fully set to a CPU setup (so 1 second compute generates 2-3 second audio).

Can you send your hardware specs to give us a better understanding of what you’re running this on - and we’ll share code & video accordingly to replicate results.

Appreciate the polite message - you are free to email me (sohaib@neuphonic.com) if you want to collaborate on making a demo for your use case!

Neuphonic org

Heya! Yeah we've been getting our numbers primarily using llama.cpp for the backbone and onnx for the decoder. If anything we're currently trying to run it on a raspberry pi 5, atm.

Nonetheless those numbers do seem quite high... I'd definitely avoid vanilla HuggingFace Transformers if you're after realtime. The GGUF versions are also quantized, so with acceleration you can get very good performance out as well. I think the issue we ran into with ONNX is that they don't quantize easily (and the different runtimes can be difficult to setup)

What's your setup atm? We don't have a huge range of hardware available in the office (mostly Apple Silicon and ARM chips) but hopefully others in the community can help out as well.

Sign up or log in to comment