[tiktok] extract subtitles and all cover types #8805
+203
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds the ability to extract subtitles for tiktok video posts.
Additionally, all available video covers can now be extracted by setting
extractor.tiktok.coversto"all".Previously, only one cover determined by a priority list was possible.
Implementation
Subtitles
By default, no subtitles are grabbed, but setting
extractor.tiktok.subtitlestotrueextracts theASR(automatic speech recognition) subtitles. To get all available subtitles, the option can be set to"all", and specific languages can be filtered with a comma-separated list. Filtering by subtitle source (ASR, machine translation, creator caption) is also possible with the same option. Examples for the configurations were added toconfiguration.rstSince the subtitles do not have a single unique ID, I did not add a new
subtitle_idfield, but rather made the provided keys available:subtitle_lang_id: Seems to be an internal numbered ID for the language.subtitle_lang_code: A non-standard language identifier likeeng-USorcmn-Hans-CN. This key is used for the language filter in the extractor config.subtitle_format: Content format of the subtitle.subtitle_version: Version of the subtitle format, sometimes followed by a translation engine like4:agent_deepSeek.subtitle_source: Short descriptor of how the subtitles were generated likeASR,MTorLC. This key is used for the source filter in the extractor config.To prevent duplicate archive entries, all keys except
subtitle_formathave been added to the defaultarchive_fmt. The new keys are only present when the current url is for a subtitle, and the defaultarchive_fmtwas modified to include the subtitle specific keys only when they are relevant, to keep existing archives compatible.The automatic subtitle selection was chosen to behave similar to yt-dlp's
--write-subssetting, although they have a more sophisticated pipeline for prioritizing and filtering subtitles. For tiktok, it seems reasonable to get the auto-recognized format, which should always be in the original audio language.Covers
Instead of only extracting the first cover from a hardcoded priority list, all covers can be extracted by setting
extractor.tiktok.coversto"all". Further filtering should be possible withextractor.*.image-filterif one wishes to do so.The type of the cover is stored in the key
cover_id, as was already the case before. To prevent duplicate archive entries, this key has been added to the defaultarchive_fmt. Similar to the subtitle keys, thecover_idis only added to the archive if it is actually set to keep backwards compatibility.Tests
An additional test for cover extraction with
"#count": 3was added.I did not add tests for subtitles yet, as I noticed that the last PR #8715 has broken many tests (The existing extractors now return intermediate URLs instead of the final resources). Apparently the
resultstests are excluded fromrun_tests.pyso this change was probably overlooked.Opinion
I was not sure if it makes sense to generate a
subtitle_idthat can be used for ensuring unique filenames and archive entries, or if we should just let the user decide themselves which keys are necessary for uniqueness. Depending on the configuration, the language alone may be enough or a combination of multiple fields may be necessary.In the end, the fields I added to
archive_fmtmore or less represent an ID, but doing it this way may be a bit cumbersome.extractor.*.image-filter, although I have not tested that.What I imagine may be interesting, is the ASR subtitles in the original language + MT in the user's local language, if available. Right now, this case is not possible to configure using
extractor.tiktok.subtitlesalone.