Add support for VLLM as whisper execution engine #50

p88h · 2025-02-15T09:12:06Z

Runs VLLM in greedy decoding mode with high batch parallelism. Tested up to batch 128 on an RTX 4080.

For AMI dataset with large-v3 model, this configuration achieves:
WER: 16.0 % RTFx: 63.56

It seems a bit faster than transformers backend, mostly thanks to wider possible batch size (that maxes out at 32 on the same GPU, achieving RTFx of 53.76).

Should scale proportionally on better hardware, and allow even wider batch sizes with more GPU memory.

It achieves slightly higher WER at the moment (16 vs 15.94), at least for this model.
For distil-large-v2, the results were almost identical for WER, again with some performance advantage.

BTW this particular dataset is likely not very representative of AMI in general and the (very low) WER results don't translate well to that whole original dataset (with very long recordings). When testing on 30s chunks, most models perform at ~25-ish WER, rather than ~15.

* Refactor data and normalizer * Update transformers * Update requirements * Update requirements * revert datasets for HF

* Update eval script for Fast Conformer NeMo models to support write and post-scoring * Add evaluate helper * Alias manifest utils in data utils * Update eval script for HF models to support write and post-scoring * Add comments Signed-off-by: smajumdar <[email protected]> * Fix detection of dataset id Signed-off-by: smajumdar <[email protected]> * Add checks for empty string in model filtering for eval script Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]>

* Add XL and XXL RNNT and CTC models Signed-off-by: Nithin Rao Koluguri <nithinraok> * update max samples Signed-off-by: Nithin Rao Koluguri <nithinraok> * use single batch size Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok>

* speechbrain initial get_model fn * wav2vec / run_eval.py working * conformer.sh * add .sh * remove pycache * fix batch size * docstring * docstring * updt * speechbrain requirements * speechbrain requirements * fix wer? * manifest * gitignore / remove savedir arg * remove speechbrain/ path * gitignore * update wav2vec * cv * update scripts * fix issue composite wer

…ers_models inference: Loop over transformers models

Tip of tree transformers seems to fix accuracy issue.

…ncrease

Signed-off-by: Kunal Dhawan <[email protected]>

Implement batching for Useful Sensors Moonshine

p88h · 2025-03-20T14:35:02Z

Hey - any updares here ? @Vaibhavs10 @Deep-unlearning

Adding api based models

Signed-off-by: Nithin Rao Koluguri <nithinraok>

Add parakeet v2

p88h · 2025-05-15T16:48:22Z

ping ?

nithinraok · 2025-05-15T17:00:28Z

@Deep-unlearning

Deep-unlearning · 2025-05-16T12:30:57Z

Hi @p88h and @nithinraok, thanks for the ping! I’ll be running the VLLM benchmarks on the datasets shortly and will share the verified WER / RTFx numbers here shortly. Once that’s done I’ll go ahead and add the results to the Whisper leaderboard.

Deep-unlearning · 2025-05-19T12:40:17Z

I ran the evaluation with openai/whisper-large-v3 and got the following results:

Composite Results:
openai/whisper-large-v3: WER = 7.45 %
openai/whisper-large-v3: RTFx = 129.66

Breakdown per dataset:

AMI: 16.01 %, RTFx = 67.41
Earnings22: 11.29 %, RTFx = 132.12
GigaSpeech: 10.08 %, RTFx = 110.92
LibriSpeech (clean): 2.04 %, RTFx = 135.80
LibriSpeech (other): 3.89 %, RTFx = 124.96
SPGISpeech: 2.95 %, RTFx = 148.46
TEDLIUM: 3.84 %, RTFx = 131.54
VoxPopuli: 9.52 %, RTFx = 163.54

The WERs are in line with the transformers backend, but I would have expected a higher RTFx (faster inference) from the vLLM backend. Are we sure the backend switch is properly impacting decoding speed?

I also tested openai/whisper-tiny.en, and the results seem unusually bad:

Composite Results:
openai/whisper-tiny.en: WER = 91.78 %
openai/whisper-tiny.en: RTFx = 584.87

Breakdown per dataset:

AMI: WER = 99.10 %, RTFx = 241.97
Earnings22: WER = 93.63 %, RTFx = 476.19
GigaSpeech: WER = 87.53 %, RTFx = 507.63
LibriSpeech (clean): WER = 80.68 %, RTFx = 537.27
LibriSpeech (other): WER = 88.43 %, RTFx = 516.17
SPGISpeech: WER = 96.19 %, RTFx = 727.25
TEDLIUM: WER = 96.50 %, RTFx = 527.08
VoxPopuli: WER = 92.22 %, RTFx = 669.52

This looks like something went wrong, possibly a decoding issue or mismatch in config. Could you help investigate?

Granite speech support

p88h · 2025-05-21T21:06:42Z

Interesting - I haven't played much with the tiny model but that looks like there is some issue indeed.
I'll have a look over the weekend.

titu1994 and others added 30 commits July 25, 2023 13:14

Add NeMo model support + Refactor codebase (huggingface#2)

df33082

* Refactor data and normalizer * Update transformers * Update requirements * Update requirements * revert datasets for HF

updating the run_whisper script to work with recent changes

40cc09a

add jiwer & librosa to requirements

8640704

fix: load_dataset -> load_data

611b27d

up

a5f8268

up

f184963

up

5c60dbb

up

868d771

Merge pull request huggingface#6 from huggingface/loop_over_transform…

6b493e5

…ers_models inference: Loop over transformers models

up

27c4e42

add hubert models

0f9c861

add data2vec models

6063541

add wavlm models

320ed6d

remove non en models

581ec3e

update hubert models

5658abf

remove wavlm models

4fdc80f

streaming -> False

de89533

indentation fix -> evals

6707cc3

final eval configurations

cb3fa07

add MMS models:

05c3c85

initial commit - rtf calculation script

398a393

up

2594106

Initial commit - RTF script

7f11b87

update rtf evals w/ all models

fa39250

update rtf evals

d96dafc

update rtf evals w/ batching

9bdd19f

disclaimer

ecd653c

njeffrie and others added 5 commits March 14, 2025 17:20

Update requirements_moonshine.txt

2dd4c3c

Tip of tree transformers seems to fix accuracy issue.

added fix for lhotse unique cut IDs, WER regression with batch_size i…

f32b6be

…ncrease

minor fixes, update batch_size

bbfb9e3

Signed-off-by: Kunal Dhawan <[email protected]>

Merge pull request huggingface#48 from njeffrie/master

a3aee8b

Implement batching for Useful Sensors Moonshine

Merge pull request huggingface#55 from KunalDhawan/add_canary_flash

41e9b34

Deep-unlearning and others added 6 commits March 21, 2025 17:29

Initial commit

bd64414

Initial commit

f6ee799

add assembly rev and elevenlab api

4d66fdb

Merging rev script with other api

52c6032

remove script from openai folder

ac22c31

Merge pull request huggingface#58 from huggingface/api_openai

41e0870

Adding api based models

p88h mentioned this pull request Apr 10, 2025

[Feature]: Benchmarks for audio models vllm-project/vllm#16354

Closed

1 task

Nithin Rao Koluguri and others added 4 commits April 29, 2025 17:16

add latest parakeet-v2

7586457

Signed-off-by: Nithin Rao Koluguri <nithinraok>

update nemo requirements

22a2053

Signed-off-by: Nithin Rao Koluguri <nithinraok>

Merge pull request huggingface#63 from huggingface/add_parakeet-v2

2472403

Add parakeet v2

Merge branch 'huggingface:main' into vllm-support

ccbc265

George Saon [email protected] and others added 4 commits May 16, 2025 20:30

Added granite support

c6fec22

Update requirements_granite.txt

8e45c0d

fix model ids

8d3d990

fix indent

d6e3b92

Deep-unlearning and others added 3 commits May 19, 2025 12:43

add requirements for vllm

216f2e6

Merge pull request huggingface#65 from gsaon/granite-speech-3.3-8b

9295007

Granite speech support

Merge branch 'huggingface:main' into vllm-support

17a21db

Deep-unlearning force-pushed the main branch from 292901a to d2167fb Compare June 24, 2025 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for VLLM as whisper execution engine #50

Add support for VLLM as whisper execution engine #50

Uh oh!

p88h commented Feb 15, 2025

Uh oh!

p88h commented Mar 20, 2025

Uh oh!

p88h commented May 15, 2025

Uh oh!

nithinraok commented May 15, 2025

Uh oh!

Deep-unlearning commented May 16, 2025 •

edited

Loading

Uh oh!

Deep-unlearning commented May 19, 2025

Uh oh!

p88h commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Add support for VLLM as whisper execution engine #50

Are you sure you want to change the base?

Add support for VLLM as whisper execution engine #50

Uh oh!

Conversation

p88h commented Feb 15, 2025

Uh oh!

p88h commented Mar 20, 2025

Uh oh!

p88h commented May 15, 2025

Uh oh!

nithinraok commented May 15, 2025

Uh oh!

Deep-unlearning commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Deep-unlearning commented May 19, 2025

Uh oh!

p88h commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Deep-unlearning commented May 16, 2025 •

edited

Loading