Skip to content

Conversation

@p88h
Copy link

@p88h p88h commented Feb 15, 2025

Runs VLLM in greedy decoding mode with high batch parallelism. Tested up to batch 128 on an RTX 4080.

For AMI dataset with large-v3 model, this configuration achieves:
WER: 16.0 % RTFx: 63.56

It seems a bit faster than transformers backend, mostly thanks to wider possible batch size (that maxes out at 32 on the same GPU, achieving RTFx of 53.76).

Should scale proportionally on better hardware, and allow even wider batch sizes with more GPU memory.

It achieves slightly higher WER at the moment (16 vs 15.94), at least for this model.
For distil-large-v2, the results were almost identical for WER, again with some performance advantage.

BTW this particular dataset is likely not very representative of AMI in general and the (very low) WER results don't translate well to that whole original dataset (with very long recordings). When testing on 30s chunks, most models perform at ~25-ish WER, rather than ~15.

titu1994 and others added 30 commits July 25, 2023 13:14
* Refactor data and normalizer

* Update transformers

* Update requirements

* Update requirements

* revert datasets for HF
* Update eval script for Fast Conformer NeMo models to support write and post-scoring

* Add evaluate helper

* Alias manifest utils in data utils

* Update eval script for HF models to support write and post-scoring

* Add comments

Signed-off-by: smajumdar <[email protected]>

* Fix detection of dataset id

Signed-off-by: smajumdar <[email protected]>

* Add checks for empty string in model filtering for eval script

Signed-off-by: smajumdar <[email protected]>

---------

Signed-off-by: smajumdar <[email protected]>
* Add XL and XXL RNNT and CTC models

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* update max samples

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* use single batch size

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
* speechbrain initial get_model fn

* wav2vec / run_eval.py working

* conformer.sh

* add .sh

* remove pycache

* fix batch size

* docstring

* docstring

* updt

* speechbrain requirements

* speechbrain requirements

* fix wer?

* manifest

* gitignore / remove savedir arg

* remove speechbrain/ path

* gitignore

* update wav2vec

* cv

* update scripts

* fix issue composite wer
…ers_models

inference: Loop over transformers models
@p88h
Copy link
Author

p88h commented Mar 20, 2025

Hey - any updares here ? @Vaibhavs10 @Deep-unlearning

Nithin Rao Koluguri and others added 4 commits April 29, 2025 17:16
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
@p88h
Copy link
Author

p88h commented May 15, 2025

ping ?

@nithinraok
Copy link
Collaborator

@Deep-unlearning

@Deep-unlearning
Copy link
Collaborator

Deep-unlearning commented May 16, 2025

Hi @p88h and @nithinraok, thanks for the ping! I’ll be running the VLLM benchmarks on the datasets shortly and will share the verified WER / RTFx numbers here shortly. Once that’s done I’ll go ahead and add the results to the Whisper leaderboard.

@Deep-unlearning
Copy link
Collaborator

I ran the evaluation with openai/whisper-large-v3 and got the following results:

Composite Results:
openai/whisper-large-v3: WER = 7.45 %
openai/whisper-large-v3: RTFx = 129.66

Breakdown per dataset:

  • AMI: 16.01 %, RTFx = 67.41
  • Earnings22: 11.29 %, RTFx = 132.12
  • GigaSpeech: 10.08 %, RTFx = 110.92
  • LibriSpeech (clean): 2.04 %, RTFx = 135.80
  • LibriSpeech (other): 3.89 %, RTFx = 124.96
  • SPGISpeech: 2.95 %, RTFx = 148.46
  • TEDLIUM: 3.84 %, RTFx = 131.54
  • VoxPopuli: 9.52 %, RTFx = 163.54

The WERs are in line with the transformers backend, but I would have expected a higher RTFx (faster inference) from the vLLM backend. Are we sure the backend switch is properly impacting decoding speed?

I also tested openai/whisper-tiny.en, and the results seem unusually bad:

Composite Results:
openai/whisper-tiny.en: WER = 91.78 %
openai/whisper-tiny.en: RTFx = 584.87

Breakdown per dataset:

  • AMI: WER = 99.10 %, RTFx = 241.97
  • Earnings22: WER = 93.63 %, RTFx = 476.19
  • GigaSpeech: WER = 87.53 %, RTFx = 507.63
  • LibriSpeech (clean): WER = 80.68 %, RTFx = 537.27
  • LibriSpeech (other): WER = 88.43 %, RTFx = 516.17
  • SPGISpeech: WER = 96.19 %, RTFx = 727.25
  • TEDLIUM: WER = 96.50 %, RTFx = 527.08
  • VoxPopuli: WER = 92.22 %, RTFx = 669.52

This looks like something went wrong, possibly a decoding issue or mismatch in config. Could you help investigate?

@p88h
Copy link
Author

p88h commented May 21, 2025

Interesting - I haven't played much with the tiny model but that looks like there is some issue indeed.
I'll have a look over the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.