Skip to content

Conversation

@foxt451
Copy link
Collaborator

@foxt451 foxt451 commented Dec 18, 2025

The framework was mostly copied from cheerio-scraper (but trimmed in a lot of places), and the request handler inspired by sitemap scraper in WCC.

There has been discussion of how to avoid duplicating code between wcc and here, and some advised to extract sitemap scraper into a package. But I then checked the code of sitemap crawler in WCC, and it's really coupled to wcc, and itself is quite short, so I just copied it over with modifications.

BUT, the one thing I copied without changes at all is discoverValidSitemaps util from WCC. I'd like to extract it somewhere e.g. into scraper-tools, because it seems like quite a generic function.

Tested locally - for now will just push dataset items with a url and status code for each page.

Closes apify/apify-sdk-js#486

@foxt451 foxt451 changed the title Feat/sitemap scraper Add sitemap scraper Dec 18, 2025
@@ -0,0 +1,126 @@
import type { ProxyConfiguration } from 'crawlee';
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file was fully copied from WCC, but I'd like to move these helpers into a shared lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actor to check web page availability

1 participant