Skip to content

Conversation

@louisjoecodes
Copy link

@louisjoecodes louisjoecodes commented Oct 28, 2025

Fixes hardcoded language_code="eng" in ElevenLabs transcription that was causing poor performance on multilingual ASR benchmarks.

Changes

  • Added extract_language_code() helper to parse language from dataset names (e.g., fleurs_frfr)
  • Updated transcribe_with_retry() to accept and use dataset parameter
  • Replaced hardcoded "eng" with dynamic language extraction

Results (on small sample size)

Before fix (hardcoded "eng"):

  • French FLEURS: 26.34% WER (leaderboard: 19.75%)
  • Portuguese FLEURS: 35.98% WER (leaderboard: 22.8%)

After fix (dynamic language code):

  • French FLEURS: 3.99% WER (85% improvement)
  • Portuguese FLEURS: 4.55% WER (87% improvement)

Previously, language_code was hardcoded to "eng" for all ElevenLabs
transcriptions, causing poor WER on multilingual benchmarks (e.g.,
French: 26.34% WER, Portuguese: 35.98% WER).

This fix:
- Extracts language code from dataset name (e.g., "fleurs_fr" → "fr")
- Dynamically sets language_code parameter based on dataset
- Defaults to "en" for English-only datasets (ami, librispeech, etc.)

Test results:
- French: 26.34% → 3.99% WER (85% improvement)
- Portuguese: 35.98% → 4.55% WER (87% improvement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant