Skip to content

Conversation

@joeydominic
Copy link

Firefox‐based HTTP fetching bridge proof-of-concept

This proposal introduces a Firefox‐based HTTP fetching bridge for gallery-dl, leveraging a real browser (Firefox ESR 128+) instead of pure-Python HTTP libraries. By routing requests through a Firefox extension and native messaging host, we can bypass aggressive anti-scraping measures on sites like Fanbox or Patreon, without resorting to brittle TLS‐signature emulation or OAuth workarounds.


Motivation

Many modern websites employ sophisticated bot-detection techniques—ranging from HTTP/2 fingerprinting to dynamic JavaScript challenges—that are extremely difficult to replicate in pure Python. For example, when scraping Fanbox, only the first post downloads successfully; subsequent requests return HTTP 403. Rather than impersonate a browser, we can simply use a real browser engine to perform HTTP fetches, inheriting all native features (TLS stack, HTTP/2 support, cookies, JavaScript execution, etc.).


Architecture Overview

  1. Firefox Extension

    • Listens for fetch commands via Native Messaging.
    • Executes fetch() in the page context, inheriting cookies, JS, and TLS/HTTP2.
    • Streams small responses over the messaging channel; large files are handled through the browser’s Downloads API.
  2. Native Messaging Host (Python)

    • Exposes a local HTTP proxy (127.0.0.1:8888).
    • For each incoming request, forwards method, headers, and URL to the extension, then relays back the status, headers, and body.
  3. HTTPS → HTTP URL Rewriting Trick

    • Why it’s needed: HTTPS proxies use the CONNECT method, establishing a blind TCP tunnel between client and server. This bypasses any browser-side interception, defeating our goal of having Firefox perform the fetch.
    • Our workaround: In gallery-dl, when the configured proxy is the FF Fetch Bridge (auto-detected), we rewrite https://…http://… before sending the request. The extension then restores the original scheme inside Firefox and issues the secure fetch.
    • Why not simply redirect the client? Because most HTTP clients will reject an unsolicited 3xx redirect or HTTP 405 when downgrading from CONNECT-style HTTPS to plain HTTP. Besides, the full url is not known for the proxy in the case of CONNECT. Our transparent rewrite keeps the proxy interface trivial for gallery-dl.

Installation & Setup (Linux)

  1. Unpack & Inspect the XPI

    Get the bridge here: https://github.com/joeydominic/ff-fetch-bridge

    wget https://github.com/joeydominic/ff-fetch-bridge/raw/refs/heads/main/native-fetch-bridge-0.1.3.xpi
    unzip native-fetch-bridge-0.1.3.xpi -d /tmp/ff-fetch-bridge

    Review manifest.json and JS to verify there’s no malicious code.

  2. Install the Firefox Extension

    • Open about:addons → “Install Add-on From File…”
    • Temporarily set xpinstall.signatures.required = false in about:config (see Mozilla docs).
  3. Register the Native Messaging Host

    sudo mkdir -p /opt/ff-fetch-bridge
    sudo cp ipc.py http_proxy.py /opt/ff-fetch-bridge/
    mkdir -p ~/.mozilla/native-messaging-hosts
    cp com.example.fffetchbridgeipc.json ~/.mozilla/native-messaging-hosts/
    • The JSON manifest points Firefox to /opt/ff-fetch-bridge/ipc.py.
  4. Launch Firefox

    • It will expose an HTTP proxy at 127.0.0.1:8888.

Usage in gallery-dl

Add or update your config under the relevant extractor (e.g. fanbox):

"fanbox": {
    "proxy": "http://127.0.0.1:8888",
    "browser": null,
    "timeout": 600
}
  • No extra headers, cookies, or OAuth tokens need to be manually managed—Firefox handles authentication state naturally. The usage is quite convenient and seamless from this step onward: you authorize at the site of interest as usual, and scrape it any way you want.

Developer Notes & Debugging

  1. Extension Debug Console

    • Visit about:debugging#/runtime/this-firefox → “Inspect” on the fetch bridge.
    • Use the Network and Console tabs for live logs.
  2. Enable Verbose Logging

    FFFETCH_BRIDGE_DEBUG=1 firefox
  • Debug version listens at a different port: 127.0.0.1:18888.
  • Extension logs (JS console.log) appear in the extension console.
  • The Python proxy writes debug output to /tmp/dbg.log and to stderr (viewable in the global browser console via Ctrl + Shift + J).
  1. Timeout Considerations

    • The extension buffers the full response before streaming; large responses may trigger read timeouts in gallery-dl.
    • Increase "timeout" as needed (e.g. 600 s).
  2. Large File Downloads

    • Files >15 MB are saved using Firefox’s Downloads API, read by the proxy, then deleted.
    • Occasional hiccups for files >100 MB can occur—use debug mode to trace issues.

Known Limitations

  • In-Memory Buffers
    Both extension and proxy fully buffer each response in RAM. Large files on low-RAM systems may trigger OOM errors.
  • HTTPS CONNECT Bypass
    Pure CONNECT requests can’t be used; the HTTP rewrite hack is mandatory.
  • Cloudflare & JS Challenges
    Sites behind Cloudflare may present interstitial JS challenges that require user interaction or headless JS completion.

TODO

  • Cloudflare Challenge Handling
    Detect Cloudflare blocks, automatically open the challenge page in a new tab, and either:

    1. Wait for the user to solve it and then resume crawling.
    2. Automate resolution via headless execution if possible.

Plea for Core Integration & Community Support

I kindly urge the gallery-dl maintainers to merge FF Fetch Bridge support into the main branch. I bet I implemented the integration poorly, sorry for that. With built-in extension detection and automatic HTTPS→HTTP rewriting, users would enjoy out-of-the-box resilience against modern anti-bot defenses.

Moreover, I invite the community to adopt and maintain the Firefox extension—improving cross-platform support, Cloudflare bypass. I'm a newbie to extensions and wrote this using ChatGPT :)

Thank you for considering this enhancement! I’m eager to collaborate on refinement, address questions, or help integrate it into gallery-dl’s codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant