TTS, or text to speech, has been around for a while now. But only recently has the quality and accuracy become similar to real human speech. ElevenLabs, an online TTS service, has become famous for being used to replicate the voices of popular figures with frightening accuracy. In this article, we’ll be pitting it up against 2 popular self hosted TTS programs, Applio and Alltalk.
⚠️Warning⚠️: The example book I use in this article, The Great Gatsby, contains misogyny, racism, domestic abuse, and other controversial topics. I do not support or condone any of them, and simply chose the book because it’s the only book in my library that’s in public domain.
To run the self hosted TTS apps I’ll be using my laptop RTX 4060 with 8GB of VRAM.
ElevenLabs
We’ll start with ElevenLabs to set the bar for whats considered high quality. For setup, all I needed was to sign up. Once my account was created I was given 10000 characters of trial credit to use, and after a bit of testing, I settled on the Rachel voice with 50% stability, 75% Similarity, 20% style exaggeration, and no speaker boost.
There were not enough credits to process the entire book so I settled on just part of the first chapter (about 5000 characters). The processing was fast, about 5–7 seconds, and the voice sounded like it was recorded on a mid tier microphone.
ElevenLabs-The-Great-Gatsby.mp3
Applio
Although we are using it just for TTS in this article, Applio can do much more than that. It also supports voice changing prerecorded audio and training models on a new voice.
The setup for Applio took longer as it needed to function locally. After running the install script, the startup took about 30 seconds before it opened the web interface in my browser. I needed to download a model to begin using the TTS. Once I chose one of the pre-trained models on the Applio website, downloading was as simple as putting the link into the interface. On the TTS page I was given 2 voice options. The first was the RVC model that I just got, the second was the base TTS voice to use. I believe what’s happening is that Applio first generates a voice recording using the base TTS and then runs it through the RVC model to improve the quality.
I then spent about 30 minutes tweaking the advanced settings until I got a result I was satisfied with. However, when I gave it The Great Gatsby the program crashed. To fix this, I had to split the text into about 15000 character chunks before sending each for processing.
Applio took about 1798 seconds to process all of the chunks (about 260000 characters total). This is just around 30 minutes. The sound quality was alright, almost like it was recorded through a laptop microphone.
Alltalk
Setup for Alltalk was similar to Applio. I ran the script and a web interface opened after a bit of waiting. Alltalk is designed to be used as an API instead of a standalone app and includes an option to integrate with Text Generation WebUI. The web interface is used for changing the settings but includes a section for testing. I enabled deepspeed and disabled low vram to give it every advantage.
Alltalk took 5719 seconds to process the text, about 95 minutes. Thankfully the app auto-split the book into chunks so memory limitations weren’t hit. Audio quality was noticeably worse than the others, sounding like it was recorded on the built in mic of a cheap pair of headphones. This is likely due to Alltalk’s lack of RVC support, which is coming in version 2.
Conclusion
TTS programs have come a long way since those robotic screen readers. If you need TTS for an offline project or don’t wan’t to pay for a cloud solution, I recommend going with Applio for it’s relatively excellent quality. Otherwise, Elevenlabs is still the king when it comes to realism and speed.
Leave a Reply
You must be logged in to post a comment.