GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Method Overview

GLAD (Global-Local Aware Dynamic Mixture-of-Experts) is a framework designed to address the challenges of multi-talker automatic speech recognition (MTASR). The motivation behind our design is as follows:

The Mixture-of-Experts (MoE) paradigm is well-suited for MTASR, as it can naturally handle varying numbers of speakers and different levels of speech overlap.
Speaker information is critical for MTASR, as it enables the model to understand who is speaking, which in turn helps determine what is being said.

Based on these insights, we propose the following framework illustrated below:

Figure: Overview of the GLAD-SOT architecture, which applies the proposed GLAD to SOT. (a) GLAD-SOT leverages original speech features to generate global expert weights for encoder layers. (b) Each MoLE layer combines global and local weights to coordinate LoRA experts. (c) The global-local aware dynamic fusion module adaptively fuses global and local weights to guide expert collaboration.

Our main contributions are as follows:

To our best knowledge, we are the first to apply MoE architectures to end-to-end MTASR, showing significant improvements over conventional methods.
We introduce GLAD, which dynamically combines speaker-aware global context with fine-grained local acoustic features. This enables experts to focus on target speakers while capturing fine-grained speech patterns in multi-talker scenarios.
Our ablation studies investigate the role of global acoustic features and their support for speaker-aware expert routing in MTASR. For more details, please refer to our paper

Training Data

Step 1: Navigate to the traindata directory and run run.sh to extract the data. This will generate two folders: generate and traindata.

Step 2:

The generate folder contains two annotation files:
- train-960-1mix.jsonl: LibriSpeech-train-960.
- train-960-2mix.jsonl: Two-talker speech created by mixing audio from two speakers from LibriSpeech-train-960.
Use the LibrispeechMix toolkit to generate the mixed audio. For each sample, the transcript is represented as "text1" (single-talker) or "text1 $ text2" (two-talker), where $ indicates a speaker change.

Step 3:

The traindata directory includes:
- wav.scp: An index file processed by ESPnet with speed perturbations (0.9x, 1.0x, 1.1x). This file illustrates the naming convention we used.
- wavlist: A list of audio IDs used as training data in our experiments.
Filter the audio generated in Step 2 using wavlist to obtain the training data used in our paper.

Using GLAD

This project is developed based on the ESPnet framework.

GLAD-specific configuration files can be found here.

Step 1:

Replace the espnet, espnet2, and egs2 directories in your local ESPnet repository with the corresponding folders from this repo. Then, update the configuration files (e.g., data paths) according to your setup.

Step 2:

Prepare the data and run run.sh.

First, execute the initial stages for data preparation, and then run stages 10 through 13 for training.

Step 3: Use run_pi_scoring.sh to evaluate the model.

The evaluation code is adapted from Speaker-Aware-CTC, and we appreciate their open-source contributions.

Contact

If you have any questions or are interested in collaboration, feel free to contact us via email:

[email protected]

Citation

If you find our work or code helpful, please consider citing our paper and giving this repository a ⭐.

@misc{guo2025gladgloballocalawaredynamic,
      title={GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR}, 
      author={Yujie Guo and Jiaming Zhou and Yuhang Jia and Shiwan Zhao and Yong Qin},
      year={2025},
      eprint={2509.13093},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.13093}, 
}

Acknowledgements

This project is built upon the ESPnet framework.

We would like to thank the following open-source projects, which inspired and supported parts of our implementation:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
espnet		espnet
traindata		traindata
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Method Overview

Training Data

Using GLAD

Contact

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

NKU-HLT/GLAD

Folders and files

Latest commit

History

Repository files navigation

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Method Overview

Training Data

Using GLAD

Contact

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages