Skip to content

Conversation

@pyf98
Copy link

@pyf98 pyf98 commented Oct 27, 2025

This PR adds OWSM-CTC models from ESPnet to the HF OpenASR Leaderboard. (cc: Shinji @sw005320)

These models correspond to Table 7 in our paper, OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning (INTERSPEECH 2025 Best Student Paper 🏆).

The hardware used was NVIDIA H100. flash_attn is required for the script.

Results on paper

image

Results reproduced on Oct 26

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 13.09 %, RTFx = 278.49
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.90 %, RTFx = 764.45
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.83 %, RTFx = 688.73
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.56 %, RTFx = 777.28
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 4.86 %, RTFx = 686.31
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.56 %, RTFx = 961.04
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 4.40 %, RTFx = 864.91
espnet/owsm_ctc_v4_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 7.34 %, RTFx = 999.36

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v4_1B: WER = 7.44 %
espnet/owsm_ctc_v4_1B: RTFx = 776.52
********************************************************************************

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 15.66 %, RTFx = 303.25
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.74 %, RTFx = 807.36
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 11.89 %, RTFx = 732.42
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.36 %, RTFx = 841.03
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.12 %, RTFx = 742.38
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.87 %, RTFx = 1065.78
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 4.97 %, RTFx = 957.90
espnet/owsm_ctc_v3.1_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.37 %, RTFx = 1083.02

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v3.1_1B: WER = 8.12 %
espnet/owsm_ctc_v3.1_1B: RTFx = 846.99
********************************************************************************

********************************************************************************
Results per dataset:
********************************************************************************
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 16.71 %, RTFx = 314.28
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 13.51 %, RTFx = 812.86
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 11.78 %, RTFx = 746.60
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.61 %, RTFx = 840.09
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.32 %, RTFx = 767.39
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.73 %, RTFx = 1074.06
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 5.35 %, RTFx = 930.88
espnet/owsm_ctc_v3.2_ft_1B | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 7.95 %, RTFx = 1156.67

********************************************************************************
Composite Results:
********************************************************************************
espnet/owsm_ctc_v3.2_ft_1B: WER = 8.24 %
espnet/owsm_ctc_v3.2_ft_1B: RTFx = 860.52
********************************************************************************

Background

Open Whisper-style Speech Model (OWSM) is the first fully open Whisper-style speech foundation model. It reproduces and advances OpenAI's Whisper-style training using publicly available data and open-source toolkits. The code, pre-trained model weights, and training logs are publicly released to promote open science in speech foundation models.

Models

@Deep-unlearning
Copy link
Collaborator

Hi @pyf98 !

Thanks for your contribution!
I will run the script and let you know the results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants