Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A big caveat of this is that the author is just looking at a ranking of open models? It's just buried in a little sentence in there but makes a big difference to model quality. Kokoro in the overall rankings is only 15, so if #15 is what you consider the "best model" you need to be cognizant that you are leaving performance on the table.

I've heard a lot of Substacks voiced by Eleven Labs models and they seem fine (with the occasional weirdness around a proper noun.) Not a bad article but I think more examples of TTS usage would be more useful.

I guess the outcome is, open weight TTS models are only okay and could be a lot better?



Yeah, from my experience the more helpful conclusion is "TTS is not commoditized yet". At some point in the next 5 years, convincing TTS will be table stakes. But for now, paying for TTS gets you better results.


The paid models are still too expensive for personal long-form use-cases. For example: if I want to generate an audiobook from a web novel, the price can go as high as thousands of dollars. If I'm just a regular reader (not the author), that's prohibitively expensive for someone who just wants to enjoy the story in a different medium.


I listen to web novels with Elevenlabs reader all the time (The 11 dollar a month unlimited plan). I love it.

When it's a foreign web novel with no English translation, I first translate the Web novel with Claude Sonnet.


Despite ElevenLabs API usage being expensive, ElevenReader is $11 a month for unlimited personal long-form content.

Even with a local model and hardware you already own, you're not beating that on electricity costs.


I dunno about the electricity claims for practical purposes— where I live, that’d be roughly 128 hours of 600W. I suppose the real question is, would it take 128 hours ($11) of power to generate 720 hours of TTS content assuming a good enough model was available?

On Android I’d imagine, big emphasis on imagine since I don’t use it, you could probably script something up and use a phone with an audio jack to record it. Theoretically hitting that maximum of 720 hours of content per month, but I’d imagine at some point they’d find it peculiar you’re listening to content 24/7.


Kokoro is available as a system TTS for Android via OSS project called "sherpa": https://k2-fsa.github.io/sherpa/onnx/android/index.html

I believe its power usage is negligible in comparison to, for example, screen or maybe even Bluetooth audio.


Yup, ElevenLabs stills rules pretty much in this space. Especially if you're looking for non-English models it's really hard to find anything good although the latest Chatterbox[1] now supports 23 languages.

[1]: https://github.com/resemble-ai/chatterbox




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: