STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Ultra-low bandwidth voice communication system achieving 70~80 bps using Speech-to-Text, intelligent compression, and Text-to-Speech.

Update: 2025.12.3 The paper is preprinted on https://arxiv.org/abs/2512.00451.

Demo

To showcase our STCTS system, here's an audio piece from Librispeech that is processed through Opus, Encodec and STCTS:

Method	Bitrate	Audio
Original	256 kbps	Listen
EnCodec	6 kbps	Listen
Opus	1 kbps	Listen
STCTS (Ours)	90.4 bps*	Listen

*w/o timbre

Introduction

Background

In some regions on the Earth and amidst ocean and the heavens, network bandwidth is extremely low and precious: an end-to-end low bandwidth dual-way voice call transference system that uses as few bandwidth resources as possible is needed to allow for convenient and cheap voice calls in those scenarios.

Our Approach: STT-Compression-TTS

STT to convert voice into text while also extract prosody and timbre
Compression to minimize text and metadata (prosody, timbre)
TTS to synthesize voice on the receiving end

Refer to the paper for more details.

Project Structure

stt-compress-tts/
├── src/
│   ├── stt/               # Speech-to-Text
│   ├── prosody/           # Prosody extraction
│   ├── speaker/           # Timbre/Speaker identification
│   ├── compression/       # Multi-stage compression
│   ├── tts/               # Text-to-Speech
│   ├── network/           # WebRTC & packet handling
│   ├── pipeline/          # Sender/receiver pipelines
│   ├── util/              # Util libs
│   ├── signaling/         # The signaling server
│   ├── audio/             # Audio transmission
│   └── app/               # CLI application
├── frontend/              # React web interface
├── tests/                 # Comprehensive test suite
├── benchmarks/            # Performance benchmarks
├── benchmarks_results/    # The benchmark results reported in the paper
├── configs/               # Quality mode configurations
├── plot_configs/          # Prosody Sampling Rate Analysis configurations
├── plots/                 # Prosody Sampling Rate Analysis result plots
├── paper/                 # Paper tex dev
├── weights/               # NISQA model weights

Setup

Make sure you have uv installed first.

Note that getting this project running is a bit complicated. You may need to manually edit the source code of some libraries to remove some false-positive exceptions raised by them. The dependency hell problem has already been resolved: some dependencies use my custom forks of them from my own forgejo instance, feel free to contact me if you cannot access them.

# Create venv
uv venv

# Run Unit Tests to auto-download the dependencies and models. Ideally, after manually resolving some libraries' problems, none should fail.
uv run pytest

# See the help screen of benchmarks module
uv run python -m benchmarks.run_all --help

# Run benchmark once: verify that everything is fine now 
uv run python -m benchmarks.run_all --audio <path/to/input/audio.wav> --reference <path/to/timbre/reference/audio.wav> --config balanced_mode

Benchmarks

# Run the Prosody Sampling Rate Analysis
uv run python -m benchmarks.run_all --plot --librispeech 10000
uv run python -m benchmarks.run_all --plot-using-json benchmark_results.json

# Run the actual Benchmark
uv run python -m benchmarks.run_all --librispeech 10000 --noise high_quality --output test_librispeech.json
uv run python -m benchmarks.run_all --interpret test_librispeech.json

Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Please read CONTRIBUTING.md for details.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) with additional restrictive terms.

For commercial licensing inquiries, please contact the project maintainers.

See the LICENSE file for complete terms.

Acknowledgments

Faster-Whisper - Fast STT
Coqui TTS - XTTS-v2 voice cloning
SpeechBrain - Timbre/Speaker embeddings
Parselmouth - Prosody extraction
aiortc - WebRTC for Python

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
configs		configs
demo		demo
figure_scripts		figure_scripts
frontend		frontend
paper		paper
plot_configs		plot_configs
scripts		scripts
src		src
tests		tests
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.ps1		setup.ps1
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Demo

Introduction

Background

Our Approach: STT-Compression-TTS

Project Structure

Setup

Benchmarks

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

dywsy21/STCTS

Folders and files

Latest commit

History

Repository files navigation

STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition

Demo

Introduction

Background

Our Approach: STT-Compression-TTS

Project Structure

Setup

Benchmarks

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages