Skip to content

Conversation

@bassberry
Copy link

@bassberry bassberry commented Jan 1, 2026

Description

This PR adds the ability to extract subtitles for tiktok video posts.

Additionally, all available video covers can now be extracted by setting extractor.tiktok.covers to "all".
Previously, only one cover determined by a priority list was possible.

Implementation

Subtitles

By default, no subtitles are grabbed, but setting extractor.tiktok.subtitles to true extracts the ASR (automatic speech recognition) subtitles. To get all available subtitles, the option can be set to "all", and specific languages can be filtered with a comma-separated list. Filtering by subtitle source (ASR, machine translation, creator caption) is also possible with the same option. Examples for the configurations were added to configuration.rst

Since the subtitles do not have a single unique ID, I did not add a new subtitle_id field, but rather made the provided keys available:

  • subtitle_lang_id: Seems to be an internal numbered ID for the language.
  • subtitle_lang_code: A non-standard language identifier like eng-US or cmn-Hans-CN. This key is used for the language filter in the extractor config.
  • subtitle_format: Content format of the subtitle.
  • subtitle_version: Version of the subtitle format, sometimes followed by a translation engine like 4:agent_deepSeek.
  • subtitle_source: Short descriptor of how the subtitles were generated like ASR, MT or LC. This key is used for the source filter in the extractor config.

To prevent duplicate archive entries, all keys except subtitle_format have been added to the default archive_fmt. The new keys are only present when the current url is for a subtitle, and the default archive_fmt was modified to include the subtitle specific keys only when they are relevant, to keep existing archives compatible.

The automatic subtitle selection was chosen to behave similar to yt-dlp's --write-subs setting, although they have a more sophisticated pipeline for prioritizing and filtering subtitles. For tiktok, it seems reasonable to get the auto-recognized format, which should always be in the original audio language.

Covers

Instead of only extracting the first cover from a hardcoded priority list, all covers can be extracted by setting extractor.tiktok.covers to "all". Further filtering should be possible with extractor.*.image-filter if one wishes to do so.

The type of the cover is stored in the key cover_id, as was already the case before. To prevent duplicate archive entries, this key has been added to the default archive_fmt. Similar to the subtitle keys, the cover_id is only added to the archive if it is actually set to keep backwards compatibility.

Tests

An additional test for cover extraction with "#count": 3 was added.
I did not add tests for subtitles yet, as I noticed that the last PR #8715 has broken many tests (The existing extractors now return intermediate URLs instead of the final resources). Apparently the results tests are excluded from run_tests.py so this change was probably overlooked.

Opinion

  • Tiktok does not provide filenames for the subtitles in the URLs and there are also no IDs available.
    I was not sure if it makes sense to generate a subtitle_id that can be used for ensuring unique filenames and archive entries, or if we should just let the user decide themselves which keys are necessary for uniqueness. Depending on the configuration, the language alone may be enough or a combination of multiple fields may be necessary.
    In the end, the fields I added to archive_fmt more or less represent an ID, but doing it this way may be a bit cumbersome.
  • I am questioning if the way subtitles can be filtered with the new configuration makes sense. It's easy to either get all subtitles or only the auto-generated subtitles. The latter probably makes sense for most users that want subtitles, as it will result in subtitles that match the original language and are at least close to the spoken text. More advanced filtering should also be possible with extractor.*.image-filter, although I have not tested that.
    What I imagine may be interesting, is the ASR subtitles in the original language + MT in the user's local language, if available. Right now, this case is not possible to configure using extractor.tiktok.subtitles alone.

…available.

The values are set '' where they are not applicable.
Having `img_id` is necessary for the default `archive_fmt`, the other fields are handled for consistency.
The previous behavior is kept as-is, but setting the "covers" option to "all" now grabs all available covers.
Allows filtering subtitles by source type (ASR, MT) and language.
Although Tiktok may serve the covers with jpeg content, the file ending can be `.image`.
The test before 0c14b16 failed because the asserted URL did not match all cover types, but the now used pattern needs the mentioned file ending.
These subtitles have the keys "Format" set to "creator_caption" and "Source" to "LC".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant