srt2audiotrack builds polished, multilingual voice-over tracks from subtitle files while keeping the original mix intact. The tooling now combines text normalisation, speaker-aware F5-TTS synthesis, Whisper-based validation, Demucs source separation, and FFmpeg mastering in a resumable pipeline that can fan out across multiple workers.
- π End-to-end pipeline β rewrites subtitles, enriches CSV metadata, synthesises aligned narration, balances the mix, and renders a muxed video output. Every stage only runs when its artefact is missing so interrupted jobs pick up where they left off.γF:srt2audiotrack/pipeline.pyβ L210-L335γ
- π£οΈ Speaker-aware synthesis β per-speaker reference audio, transcripts, and speed curves drive F5-TTS segment generation; any missing
speeds.csvfiles are generated automatically.γF:srt2audiotrack/subtitle_csv.pyβ L162-L214γ - β Automatic quality checks β generated speech is round-tripped through Whisper to confirm it matches the subtitle text. Mismatches are logged with similarity scores for manual review.γF:srt2audiotrack/tts_audio.pyβ L233-L305γ
- π¦ Job manifests & cooperative locking β manifests expand into ordered subtitle queues and per-job lock files prevent duplicate processing across workers, with automatic stale-lock recovery.γF:srt2audiotrack/cli.pyβ L26-L181γγF:srt2audiotrack/pipeline.pyβ L25-L361γ
- Subtitle normalisation β applies vocabulary substitutions and writes
_0_mod.srt.γF:srt2audiotrack/pipeline.pyβ L246-L254γγF:srt2audiotrack/vocabulary.pyβ L5-L76γ - CSV enrichment & speakers β converts SRT to CSV, injects speaker columns, and assigns TTS speeds from speaker metadata.γF:srt2audiotrack/pipeline.pyβ L254-L276γγF:srt2audiotrack/subtitle_csv.pyβ L58-L199γ
- Segment synthesis & validation β F5-TTS renders per-line audio, regenerating segments that are too short and flagging Whisper mismatches for audit spreadsheets.γF:srt2audiotrack/tts_audio.pyβ L200-L307γγF:srt2audiotrack/subtitle_csv.pyβ L218-L238γ
- Timing correction & stitching β fixes CSV end-times from the generated waveforms and concatenates the mono narration into a full FLAC track before upmixing to stereo.γF:srt2audiotrack/pipeline.pyβ L278-L293γγF:srt2audiotrack/sync_utils.pyβ L8-L52γγF:srt2audiotrack/audio_utils.pyβ L83-L198γ
- Source separation & mixing β extracts the original soundtrack, prepares a normalised accompaniment, applies interval-based gain curves, and muxes everything back with FFmpeg.γF:srt2audiotrack/pipeline.pyβ L295-L333γγF:srt2audiotrack/audio_utils.pyβ L24-L250γγF:srt2audiotrack/ffmpeg_utils.pyβ L1-L89γ
ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββββββ
β Subtitle β β CSV & speaker β β Timing correction & β
β normalisation ββββΆβ enrichment ββββΆβ narration stitching β
ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
Vocabulary rules Speaker metadata FLAC narration
β β β
ββββββββββββββ¬βββββββββββ΄βββββββββββββββ¬ββββββββββββ
βΌ βΌ
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Source audio prep β β Dynamic mixing & β
β (demucs, loudness) ββββΆβ final muxing β
ββββββββββββββββββββββ ββββββββββββββββββββββ
- Create and activate an environment (example using conda):
conda create -n srt2audio python=3.10 conda activate srt2audio
- Install the core runtime dependencies:
pip install f5-tts demucs librosa soundfile numpy ffmpeg-python
- Install the project requirements (includes Whisper, Demucs, F5-TTS, FastAPI, etc.):
pip install -r requirements.txt
- Ensure
ffmpegis available on yourPATHfor muxing and loudness normalisation.
Each subtitle/video set should contain a neighbouring VOICE/ directory with:
- Reference
.wavfiles for each speaker (the first one becomes the default).γF:srt2audiotrack/subtitle_csv.pyβ L162-L194γ - Matching
.txttranscripts so synthesis can validate reference text.γF:srt2audiotrack/subtitle_csv.pyβ L195-L203γ - Optional
speeds.csvenvelopes per speaker; missing files are generated automatically using the F5-TTS helper.γF:srt2audiotrack/subtitle_csv.pyβ L200-L214γ - A shared
vocabular.txtfile; it is created on demand if absent.γF:srt2audiotrack/vocabulary.pyβ L5-L13γ
See tests/one_voice for a minimal layout.
Process every .srt file in a directory, producing muxed videos beside the originals:
python -m srt2audiotrack --subtitle path/to/recordsProcess a single subtitle/video pair:
python -m srt2audiotrack --subtitle path/to/video.srt- Use
--job-manifest-dirto point at newline-delimited job files; relative paths are resolved next to the manifest and duplicates are automatically removed.γF:srt2audiotrack/cli.pyβ L26-L143γ - Provide
--worker-id(or rely on the hostname) so lock files record who owns a job. Locks refresh on a heartbeat and are reclaimed when stale, enabling safe restarts across machines.γF:srt2audiotrack/cli.pyβ L77-L181γγF:srt2audiotrack/pipeline.pyβ L25-L361γ
For a subtitle named example.srt, intermediate files live under OUTPUT/example/ while the final muxed video is written beside the subtitle (or into --output_folder). The pipeline checks for each artefact before running a step, so reruns process only the missing stages.γF:srt2audiotrack/pipeline.pyβ L171-L335γ
| Option | Description | Default |
|---|---|---|
--subtitle |
Path to subtitle file or directory to scan | records |
--speeds |
Path to default speeds table | speeds.csv |
--delay |
Minimum gap used when collapsing subtitle lines | 0.00001 |
--voice |
Reference voice file used for diagnostics | basic_ref_en.wav |
--text |
Reference text used with --voice |
some call me nature, others call me mother nature. |
--videoext |
Expected video extension when scanning folders | .mp4 |
--srtext |
Subtitle extension when scanning folders | .srt |
--outfileending |
Suffix for rendered video names | _out_mix.mp4 |
--vocabular |
Override path to vocabulary file | vocabular.txt |
--config / -c |
Optional TOML configuration file | basic.toml |
--acomponiment_coef |
Mix level for the background accompaniment | 0.2 |
--voice_coef |
Mix level for generated voice | 0.2 |
--output_folder |
Custom directory for pipeline artefacts and final video | same as subtitle parent |
--job-manifest-dir |
Folder containing job manifest files | (empty) |
--worker-id |
Identifier recorded in lock files | hostname or PIPELINE_WORKER_ID |
--lock-timeout |
Seconds before a lock is considered stale | 1800.0 |
--lock-heartbeat |
Seconds between lock refreshes | 60.0 |
(See python -m srt2audiotrack --help for the authoritative list.)γF:srt2audiotrack/cli.pyβ L44-L181γ
- Run the Python unit tests:
pytest tests/unit
- Sample subtitle fixtures live in
tests/one_voiceandtests/multi_voice. - The
tests/test_whisper_metrics.pyscript exercises the Whisper validation pipeline.
The repository ships with a microservice-oriented demo web application under srt2audiotrack-docker/. The compose stack splits the responsibilities across four containers:
| Service | Role | Exposed port | Notes |
|---|---|---|---|
tts_service |
Generates mock narration audio from plain text. | 8001 |
Coordinates synthesis with a .lock file. |
demucs_service |
Performs a lightweight Demucs-style source separation. | 8002 |
Uses a .lock for its workspace. |
subtitles_service |
Stores subtitle vocabulary in a SQLite database. | 8003 |
Uses a .lock for SQLite writes. |
orchestrator |
Web UI that orchestrates the three backend services. | 8000 |
HTTP frontend for the services. |
- Docker Engine 20.10+
- Docker Compose v2 (
docker composeCLI)
cd srt2audiotrack-docker
docker compose build
# or refresh the base images first
docker compose build --pullcd srt2audiotrack-docker
docker compose up
# Run in detached mode once you are happy with the logs:
docker compose up -dThe orchestrator UI becomes available at http://localhost:8000. Submit text (and optional subtitle snippets) to exercise the round-trip across the TTS, Demucs, and subtitle vocabulary services. Named volumes persist generated audio and the SQLite database between runs.
To stop the stack and remove containers, run:
docker compose downAll services expose a FastAPI app on the ports listed above. Once built, you can reuse the images in other compose files or orchestration platforms. For example:
docker run --rm -p 9000:8001 tts_service
docker run --rm -p 9001:8002 demucs_service
docker run --rm -p 9002:8003 subtitles_service
docker run --rm -p 9003:8000 \
-e TTS_URL=http://host.docker.internal:9000 \
-e DEMUCS_URL=http://host.docker.internal:9001 \
-e SUBTITLES_URL=http://host.docker.internal:9002 \
orchestratorWith the containers running, the CLI module remains available in parallel for offline batch conversion.
- Inspection β Lock files live beside the subtitle output directory (e.g.
OUTPUT/example/example.lock). They are plain text and record the current worker ID, timestamps, and heartbeat interval. - Refreshing β Active workers refresh their lock on a background heartbeat. If a worker stops unexpectedly the lock becomes stale after
--lock-timeoutseconds and other workers automatically reclaim the job.γF:srt2audiotrack/pipeline.pyβ L25-L144γ - Manual recovery β When coordinating manually, you can delete or rename a stale lock file if you are sure no other worker is operating on the job. On the next manifest scan, an available worker obtains a fresh lock and resumes from cached artefacts.
Use the high-level helper to invoke the pipeline programmatically:
from pathlib import Path
from srt2audiotrack import SubtitlePipeline
SubtitlePipeline.create_video_with_english_audio(
video_path="video.mp4",
subtitle=Path("subtitles.srt"),
speakers=speakers_config,
default_speaker=speakers_config["en"],
vocabular=Path("VOICE/vocabular.txt"),
acomponiment_coef=0.3,
voice_coef=0.2,
output_folder=Path("out"),
)This wrapper wires up the same pipeline used by the CLI while allowing advanced dependency injection for testing.γF:srt2audiotrack/pipeline.pyβ L372-L403γ
- Verify the external CLIs are available:
python -m whisper --help python -m demucs.separate --help python -m f5_tts.cli --help
- If a job is skipped with a lock warning, inspect the
.lockfile inside the subtitle output folder to confirm the active worker ID or delete stale locks after the timeout has elapsed.γF:srt2audiotrack/pipeline.pyβ L25-L361γ
Happy dubbing!