LuxTTS is an lightweight zipvoice based text-to-speech model designed for high quality voice cloning and realistic generation at speeds exceeding 150x realtime.
LuxTTS_demo.mp4
- Voice cloning: SOTA voice cloning on par with models 10x larger.
- Clarity: Clear 48khz speech generation unlike most TTS models which are limited to 24khz.
- Speed: Reaches speeds of 150x realtime on a single GPU and faster then realtime on CPU's as well.
- Efficiency: Fits within 1gb vram meaning it can fit in any local gpu.
You can try it locally, colab, or spaces.
git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt
from zipvoice.luxvoice import LuxTTS
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda', threads=2) ## change device to cpu for cpu usageimport soundfile as sf
from IPython.display import Audio
text = "Hey, what's up? I'm feeling really great if you ask me honestly!"
## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'
## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)
## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=num_steps)
## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', audio_data, 48000)
## display speech
display(Audio(final_wav, rate=48000))import soundfile as sf
from IPython.display import Audio
text = "Hey, what's up? I'm feeling really great if you ask me honestly!"
## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'
rms = 0.01 ## higher makes it sound louder(0.01 or so recommended)
t_shift = 0.9 ## sampling param, higher can sound better but worse WER
num_steps = 4 ## sampling param, higher sounds better but takes longer(3-4 is best for efficiency)
speed = 1.0 ## sampling param, controls speed of audio(lower=slower)
return_smooth = False ## sampling param, makes it sound smoother possibly but less cleaner
ref_duration = 5 ## Setting it lower can speedup inference, set to 1000 if you find artifacts.
## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, duration=ref_duration, rms=rms)
## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=num_steps, t_shift=t_shift, speed=speed, return_smooth=return_smooth)
## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', audio_data, 48000)
## display speech
display(Audio(final_wav, rate=48000))- Please use at minimum a 3 second audio file for voice cloning.
- You can use return_smooth = True if you hear metallic sounds.
- Lower t_shift for less possible pronunciation errors but worse quality and vice versa.
Q: How is this different from ZipVoice?
A: LuxTTS uses the same architecture but distilled to 4 steps with an improved sampling technique. It also uses a custom 48khz vocoder instead of the default 24khz version.
Q: Can it be even faster?
A: Yes, currently it uses float32. Float16 should be significantly faster(almost 2x).
- Release model and code
- Huggingface spaces demo
- Release mps support
- Release code for float16 inference
The model and code are licensed under the Apache-2.0 license. See LICENSE for details.
Stars/Likes would be appreciated, thank you.
Email: yatharthsharma350@gmail.com