soprano-demo.mp4
Soprano is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy at under 1 GB VRAM usage.
With only 80M parameters, Soprano achieves a real‑time factor (RTF) of ~2000×, capable of generating 10 hours of audio in under 20 seconds. Soprano uses a seamless streaming technique that enables true real‑time synthesis in <15 ms, multiple orders of magnitude faster than existing TTS pipelines.
Requirements: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon!).
pip install soprano-tts
pip uninstall -y torch
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .
pip uninstall -y torch
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126Note: Soprano uses LMDeploy to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace transformers backend (with slower performance). To enable this, pass
backend='transformers'when creating the TTS model.
from soprano import SopranoTTS
model = SopranoTTS(backend='auto', device='cuda', cache_size_mb=10, decoder_batch_size=1)Tip: You can increase cache_size_mb and decoder_batch_size to increase inference speed at the cost of higher memory usage.
out = model.infer("Soprano is an extremely lightweight text to speech model.") # can achieve 2000x real-time with sufficiently long input!out = model.infer("Soprano is an extremely lightweight text to speech model.", "out.wav")out = model.infer(
"Soprano is an extremely lightweight text to speech model.",
temperature=0.3,
top_p=0.95,
repetition_penalty=1.2,
)out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10) # can achieve 2000x real-time with sufficiently large input size!out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10, "/dir")import torch
stream = model.infer_stream("Soprano is an extremely lightweight text to speech model.", chunk_size=1)
# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
chunks.append(chunk) # first chunk arrives in <15 ms!
out = torch.cat(chunks)- Soprano works best when each sentence is between 2 and 15 seconds long.
- Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them. Best results can be achieved by converting these into their phonetic form. (1+1 -> one plus one, etc)
- If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation. You may also change the sampling settings for more varied results.
- Avoid improper grammar such as not using contractions, multiple spaces, etc.
Soprano synthesizes speech at 32 kHz, delivering quality that is perceptually indistinguishable from 44.1/48 kHz audio and significantly sharper and clearer than the 24 kHz output used by many existing TTS models.
Instead of slow diffusion decoders, Soprano uses a vocoder‑based decoder with a Vocos architecture, enabling orders‑of‑magnitude faster waveform generation while maintaining comparable perceptual quality.
Soprano leverages the decoder’s finite receptive field to losslessly stream audio with ultra‑low latency. The streamed output is acoustically identical to offline synthesis, and streaming can begin after generating just 5 audio tokens, enabling <15 ms latency.
Speech is represented using a neural codec that compresses audio to ~15 tokens/sec at just 0.2 kbps, allowing extremely fast generation and efficient memory usage without sacrificing quality.
Each sentence is generated independently, enabling effectively infinite generation length while maintaining stability and real‑time performance for long‑form generation.
I’m a second-year undergrad who’s just started working on TTS models, so I wanted to start small. Soprano was only pretrained on 1000 hours of audio (~100x less than other TTS models), so its stability and quality will improve tremendously as I train it on more data. Also, I optimized Soprano purely for speed, which is why it lacks bells and whistles like voice cloning, style control, and multilingual support. Now that I have experience creating TTS models, I have a lot of ideas for how to make Soprano even better in the future, so stay tuned for those!
- Add model and inference code
- Seamless streaming
- Batched inference
- Command-line interface (CLI)
- Server / API inference
- Additional LLM backends
- CPU support
- Voice cloning
- Multilingual support
Soprano uses and/or is inspired by the following projects:
This project is licensed under the Apache-2.0 license. See LICENSE for details.