Skip to content

Unofficial implementation of Spectrogram VQ from DCTTS paper - Vector quantization of mel-spectrograms for discrete speech representation

License

Notifications You must be signed in to change notification settings

Orca0917/Spectrogram-VQ

Repository files navigation

Spectrogram Vector Quantization

This is an unofficial implementation of the Spectrogram VQ module from the DCTTS (Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation) paper. DCTTS proposes a novel TTS approach that leverages discrete diffusion models and contrastive learning to generate high-quality speech. This repository specifically implements the Spectrogram VQ component described in Section 2.1 of the paper, which quantizes mel-spectrograms into discrete representations.

Spectrogram VQ Architecture
Figure 1. Spectrogram VQ architecture from DCTTS paper

Environment

  • Docker image: pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
  • GPU: NVIDIA RTX 4060 (8GB VRAM)

Setup

  1. Clone this repository and install Python requirements:

    pip install -r requirements.txt
  2. Download the LJSpeech dataset and place it under /workspace/data/ (recommended)

How to Run

  1. Prepare mel-spectrogram features in .npy format:

    python preprocess.py
  2. Train the Spectrogram VQ model:

    python train_vqgan.py
  3. Run inference:

    Open the inference.ipynb notebook for inference examples and usage.

Results

Audio samples can be found in the sample/ directory. Most hyperparameters follow the VQGAN implementation from dome272/VQGAN-pytorch.


Figure 2. Spectrogram indices visualization

Acknowledgements

This implementation builds upon several excellent works:

  • DCTTS Paper: Wu, Zhichao, et al. "DCTTS: Discrete diffusion model with contrastive learning for text-to-speech generation." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
  • VQGAN Implementation: dome272/VQGAN-pytorch for the VQGAN architecture reference.
  • HiFi-GAN: keonlee9420's HiFi-GAN implementation for the vocoder parameters.
  • Original HiFi-GAN: jik876/hifi-gan for the original HiFi-GAN model.

I am grateful to all the authors for making their work publicly available. 👏

About

Unofficial implementation of Spectrogram VQ from DCTTS paper - Vector quantization of mel-spectrograms for discrete speech representation

Topics

Resources

License

Stars

Watchers

Forks