Skip to content

Conversation

@mikf
Copy link
Owner

@mikf mikf commented Oct 25, 2025

In an attempt at reducing resource usage when running gallery-dl and searching a suitable extractor for a URL, I've moved several static "resources" (mostly GraphQL queries) and functions & classes (mostly API interfaces) out of their main extractor modules into separate /extractor/utils modules so they don't get needlessly loaded when importing extractor modules and matching their RegExp patterns.

It is currently 7k lines of code out of 40k or 270 kB of Python bytecode that won't get loaded when searching an extractor class, but only when needed.

Let me know what you think of this idea, what else could/should be exported, and if this whole ordeal broke anything.

@thatfuckingbird
Copy link
Contributor

Interesting, where did this idea come from? I never noticed gallery-dl being slow. I do use extractor.find in hydownloader to provide some info about whether a URL is supported and by what extractor, but since that's done in a long-running daemon process on request, the load time reduction would only apply at the very first call if I understand correctly.

Do you have benchmarks before vs. after timings? Personally I would only consider doing something like this if there is significant measurable impact (and it actually matters for some use case) or if it actually improves code organization. Can't say I'm a fan of splitting extractors into 2 places but maybe that's just what I'm being used to.

@mikf
Copy link
Owner Author

mikf commented Oct 28, 2025

where did this idea come from?

I was attempting to fix joyreactor extractors (#6642) and noticed that it would require several hundreds or thousands of lines of GraphQL queries. This is the idea I came up with, although it spiraled a bit out of control.

I never noticed gallery-dl being slow

Well, it is not really "slow" considering it's written in Python, but its startup could be faster and it is getting progressively slower as more and more extractor modules are added.

the load time reduction would only apply at the very first call if I understand correctly.

Exactly, but I assume most users run gallery-dl with only one URL at a time, and making the process of finding a matching pattern is somewhat important in that case. At least it something I like to improve and work on, for what it's worth.

Do you have benchmarks before vs. after timings?

Well... not really. I can measure an insignificant reduction by ~10-20 ms (460ms -> 440ms) on my machine for loading all modules when inputting an unsupported "URL". I had hoped that there would be a more noticeable effect, but oh well. Better than nothing, I guess.

It should at least reduce the amount of memory used by gallery-dl since a lot of static resources are no longer loaded by default unless needed, and the overhead is insignificant once loaded for the first time.

Can't say I'm a fan of splitting extractors into 2 places

One could argue that #4504 was already the first step in this direction, and nobody "complained" then. Removing tests made the code a lot more readable than, for example, the code of yt-dlp extractors, which are usually 50% test data...

@thatfuckingbird
Copy link
Contributor

Yeah based on that ~10-20 ms reduction I would say this change doesn't make sense as an optimization. On the other hand, if you prefer the code organized this way then sure go for it. Personally I wouldn't do it, but don't really have any good reasons beside subjective taste. And if we might have hundreds to thousands of lines of GraphQL stuff then at least separating that sounds like a good idea.

@mikf
Copy link
Owner Author

mikf commented Oct 30, 2025

Guess I'll revert most of the changes and apply this to only GraphQL queries and other larger utility functions like DA Tiptap-to-HTML, Twitter Transaction ID, and Tsumino JSURL, i.e. no API interface code.

@mikf mikf changed the title Lazy load extractor resources & utility classes Lazy load extractor resources & utilities Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants