This is an unofficial implementation of the Spectrogram VQ module from the DCTTS (Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation) paper. DCTTS proposes a novel TTS approach that leverages discrete diffusion models and contrastive learning to generate high-quality speech. This repository specifically implements the Spectrogram VQ component described in Section 2.1 of the paper, which quantizes mel-spectrograms into discrete representations.
Figure 1. Spectrogram VQ architecture from DCTTS paper
- Docker image:
pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel - GPU: NVIDIA RTX 4060 (8GB VRAM)
-
Clone this repository and install Python requirements:
pip install -r requirements.txt
-
Download the LJSpeech dataset and place it under
/workspace/data/(recommended)
-
Prepare mel-spectrogram features in
.npyformat:python preprocess.py
-
Train the Spectrogram VQ model:
python train_vqgan.py
-
Run inference:
Open the
inference.ipynbnotebook for inference examples and usage.
Audio samples can be found in the sample/ directory. Most hyperparameters follow the VQGAN implementation from dome272/VQGAN-pytorch.
This implementation builds upon several excellent works:
- DCTTS Paper: Wu, Zhichao, et al. "DCTTS: Discrete diffusion model with contrastive learning for text-to-speech generation." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
- VQGAN Implementation: dome272/VQGAN-pytorch for the VQGAN architecture reference.
- HiFi-GAN: keonlee9420's HiFi-GAN implementation for the vocoder parameters.
- Original HiFi-GAN: jik876/hifi-gan for the original HiFi-GAN model.
I am grateful to all the authors for making their work publicly available. 👏
