Add ESPnet OWSM-CTC Models #105

pyf98 · 2025-10-27T03:55:28Z

This PR adds OWSM-CTC models from ESPnet to the HF OpenASR Leaderboard. (cc: Shinji @sw005320)

These models correspond to Table 7 in our paper, OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning (INTERSPEECH 2025 Best Student Paper 🏆).

The hardware used was NVIDIA H100. flash_attn is required for the script.

Results on paper

Results reproduced on Oct 26

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 13.09 %, RTFx = 278.49
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.90 %, RTFx = 764.45
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.83 %, RTFx = 688.73
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.56 %, RTFx = 777.28
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 4.86 %, RTFx = 686.31
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.56 %, RTFx = 961.04
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 4.40 %, RTFx = 864.91
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 7.34 %, RTFx = 999.36

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v4_1B: WER = 7.44 %
espnet/owsm_ctc_v4_1B: RTFx = 776.52
********************************************************************************

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 15.66 %, RTFx = 303.25
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.74 %, RTFx = 807.36
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 11.89 %, RTFx = 732.42
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.36 %, RTFx = 841.03
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.12 %, RTFx = 742.38
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.87 %, RTFx = 1065.78
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 4.97 %, RTFx = 957.90
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.37 %, RTFx = 1083.02

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v3.1_1B: WER = 8.12 %
espnet/owsm_ctc_v3.1_1B: RTFx = 846.99
********************************************************************************

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 16.71 %, RTFx = 314.28
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.51 %, RTFx = 812.86
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 11.78 %, RTFx = 746.60
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.61 %, RTFx = 840.09
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.32 %, RTFx = 767.39
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.73 %, RTFx = 1074.06
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 5.35 %, RTFx = 930.88
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 7.95 %, RTFx = 1156.67

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v3.2_ft_1B: WER = 8.24 %
espnet/owsm_ctc_v3.2_ft_1B: RTFx = 860.52
********************************************************************************

Background

Open Whisper-style Speech Model (OWSM) is the first fully open Whisper-style speech foundation model. It reproduces and advances OpenAI's Whisper-style training using publicly available data and open-source toolkits. The code, pre-trained model weights, and training logs are publicly released to promote open science in speech foundation models.

Models

Deep-unlearning · 2025-11-06T12:22:41Z

Hi @pyf98 !

Thanks for your contribution!
I will run the script and let you know the results!

pyf98 added 3 commits October 26, 2025 18:59

add espnet

6019b91

add espnet owsmctc scripts

e0996d6

update espnet

4e277da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ESPnet OWSM-CTC Models #105

Add ESPnet OWSM-CTC Models #105

Uh oh!

pyf98 commented Oct 27, 2025 •

edited

Loading

Uh oh!

Deep-unlearning commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add ESPnet OWSM-CTC Models #105

Are you sure you want to change the base?

Add ESPnet OWSM-CTC Models #105

Uh oh!

Conversation

pyf98 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results on paper

Results reproduced on Oct 26

Background

Models

Uh oh!

Deep-unlearning commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pyf98 commented Oct 27, 2025 •

edited

Loading