Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

soprano-demo.mp4

Overview

Soprano is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy at under 1 GB VRAM usage.

With only 80M parameters, Soprano achieves a real‑time factor (RTF) of ~2000×, capable of generating 10 hours of audio in under 20 seconds. Soprano uses a seamless streaming technique that enables true real‑time synthesis in <15 ms, multiple orders of magnitude faster than existing TTS pipelines.

Installation

Requirements: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon!).

Install with wheel

pip install soprano-tts
pip uninstall -y torch
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

Install from source

git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .
pip uninstall -y torch
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

Note: Soprano uses LMDeploy to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace transformers backend (with slower performance). To enable this, pass backend='transformers' when creating the TTS model.

Usage

from soprano import SopranoTTS

model = SopranoTTS(backend='auto', device='cuda', cache_size_mb=10, decoder_batch_size=1)

Tip: You can increase cache_size_mb and decoder_batch_size to increase inference speed at the cost of higher memory usage.

Basic inference

out = model.infer("Soprano is an extremely lightweight text to speech model.") # can achieve 2000x real-time with sufficiently long input!

Save output to a file

out = model.infer("Soprano is an extremely lightweight text to speech model.", "out.wav")

Custom sampling parameters

out = model.infer(
    "Soprano is an extremely lightweight text to speech model.",
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.2,
)

Batched inference

out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10) # can achieve 2000x real-time with sufficiently large input size!

Save batch outputs to a directory

out = model.infer_batch(["Soprano is an extremely lightweight text to speech model."] * 10, "/dir")

Streaming inference

import torch

stream = model.infer_stream("Soprano is an extremely lightweight text to speech model.", chunk_size=1)

# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
    chunks.append(chunk) # first chunk arrives in <15 ms!

out = torch.cat(chunks)

Usage tips:

Soprano works best when each sentence is between 2 and 15 seconds long.
Although Soprano recognizes numbers and some special characters, it occasionally mispronounces them. Best results can be achieved by converting these into their phonetic form. (1+1 -> one plus one, etc)
If Soprano produces unsatisfactory results, you can easily regenerate it for a new, potentially better generation. You may also change the sampling settings for more varied results.
Avoid improper grammar such as not using contractions, multiple spaces, etc.

Key Features

1. High‑fidelity 32 kHz audio

Soprano synthesizes speech at 32 kHz, delivering quality that is perceptually indistinguishable from 44.1/48 kHz audio and significantly sharper and clearer than the 24 kHz output used by many existing TTS models.

2. Vocoder‑based neural decoder

Instead of slow diffusion decoders, Soprano uses a vocoder‑based decoder with a Vocos architecture, enabling orders‑of‑magnitude faster waveform generation while maintaining comparable perceptual quality.

3. Seamless Streaming

Soprano leverages the decoder’s finite receptive field to losslessly stream audio with ultra‑low latency. The streamed output is acoustically identical to offline synthesis, and streaming can begin after generating just 5 audio tokens, enabling <15 ms latency.

4. State‑of‑the‑art neural audio codec

Speech is represented using a neural codec that compresses audio to ~15 tokens/sec at just 0.2 kbps, allowing extremely fast generation and efficient memory usage without sacrificing quality.

5. Sentence‑level streaming for infinite context

Each sentence is generated independently, enabling effectively infinite generation length while maintaining stability and real‑time performance for long‑form generation.

Limitations

I’m a second-year undergrad who’s just started working on TTS models, so I wanted to start small. Soprano was only pretrained on 1000 hours of audio (~100x less than other TTS models), so its stability and quality will improve tremendously as I train it on more data. Also, I optimized Soprano purely for speed, which is why it lacks bells and whistles like voice cloning, style control, and multilingual support. Now that I have experience creating TTS models, I have a lot of ideas for how to make Soprano even better in the future, so stay tuned for those!

Roadmap

Acknowledgements

Soprano uses and/or is inspired by the following projects:

License

This project is licensed under the Apache-2.0 license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
soprano		soprano
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

Overview

Installation

Install with wheel

Install from source

Usage

Basic inference

Save output to a file

Custom sampling parameters

Batched inference

Save batch outputs to a directory

Streaming inference

Usage tips:

Key Features

1. High‑fidelity 32 kHz audio

2. Vocoder‑based neural decoder

3. Seamless Streaming

4. State‑of‑the‑art neural audio codec

5. Sentence‑level streaming for infinite context

Limitations

Roadmap

Acknowledgements

License

About

Uh oh!

Releases

Packages

Languages

License

visionik/soprano

Folders and files

Latest commit

History

Repository files navigation

Soprano: Instant, Ultra‑Realistic Text‑to‑Speech

Overview

Installation

Install with wheel

Install from source

Usage

Basic inference

Save output to a file

Custom sampling parameters

Batched inference

Save batch outputs to a directory

Streaming inference

Usage tips:

Key Features

1. High‑fidelity 32 kHz audio

2. Vocoder‑based neural decoder

3. Seamless Streaming

4. State‑of‑the‑art neural audio codec

5. Sentence‑level streaming for infinite context

Limitations

Roadmap

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages