STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
Ultra-low bandwidth voice communication system achieving 70~80 bps using Speech-to-Text, intelligent compression, and Text-to-Speech.
Update: 2025.12.3 The paper is preprinted on https://arxiv.org/abs/2512.00451.
To showcase our STCTS system, here's an audio piece from Librispeech that is processed through Opus, Encodec and STCTS:
| Method | Bitrate | Audio |
|---|---|---|
| Original | 256 kbps | Listen |
| EnCodec | 6 kbps | Listen |
| Opus | 1 kbps | Listen |
| STCTS (Ours) | 90.4 bps* | Listen |
*w/o timbre
In some regions on the Earth and amidst ocean and the heavens, network bandwidth is extremely low and precious: an end-to-end low bandwidth dual-way voice call transference system that uses as few bandwidth resources as possible is needed to allow for convenient and cheap voice calls in those scenarios.
- STT to convert voice into text while also extract prosody and timbre
- Compression to minimize text and metadata (prosody, timbre)
- TTS to synthesize voice on the receiving end
Refer to the paper for more details.
stt-compress-tts/
├── src/
│ ├── stt/ # Speech-to-Text
│ ├── prosody/ # Prosody extraction
│ ├── speaker/ # Timbre/Speaker identification
│ ├── compression/ # Multi-stage compression
│ ├── tts/ # Text-to-Speech
│ ├── network/ # WebRTC & packet handling
│ ├── pipeline/ # Sender/receiver pipelines
│ ├── util/ # Util libs
│ ├── signaling/ # The signaling server
│ ├── audio/ # Audio transmission
│ └── app/ # CLI application
├── frontend/ # React web interface
├── tests/ # Comprehensive test suite
├── benchmarks/ # Performance benchmarks
├── benchmarks_results/ # The benchmark results reported in the paper
├── configs/ # Quality mode configurations
├── plot_configs/ # Prosody Sampling Rate Analysis configurations
├── plots/ # Prosody Sampling Rate Analysis result plots
├── paper/ # Paper tex dev
├── weights/ # NISQA model weights
Make sure you have uv installed first.
Note that getting this project running is a bit complicated. You may need to manually edit the source code of some libraries to remove some false-positive exceptions raised by them. The dependency hell problem has already been resolved: some dependencies use my custom forks of them from my own forgejo instance, feel free to contact me if you cannot access them.
# Create venv
uv venv
# Run Unit Tests to auto-download the dependencies and models. Ideally, after manually resolving some libraries' problems, none should fail.
uv run pytest
# See the help screen of benchmarks module
uv run python -m benchmarks.run_all --help
# Run benchmark once: verify that everything is fine now
uv run python -m benchmarks.run_all --audio <path/to/input/audio.wav> --reference <path/to/timbre/reference/audio.wav> --config balanced_mode# Run the Prosody Sampling Rate Analysis
uv run python -m benchmarks.run_all --plot --librispeech 10000
uv run python -m benchmarks.run_all --plot-using-json benchmark_results.json
# Run the actual Benchmark
uv run python -m benchmarks.run_all --librispeech 10000 --noise high_quality --output test_librispeech.json
uv run python -m benchmarks.run_all --interpret test_librispeech.json- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
Please read CONTRIBUTING.md for details.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) with additional restrictive terms.
For commercial licensing inquiries, please contact the project maintainers.
See the LICENSE file for complete terms.
- Faster-Whisper - Fast STT
- Coqui TTS - XTTS-v2 voice cloning
- SpeechBrain - Timbre/Speaker embeddings
- Parselmouth - Prosody extraction
- aiortc - WebRTC for Python