Add sitemap scraper #205

foxt451 · 2025-12-18T20:20:24Z

The framework was mostly copied from cheerio-scraper (but trimmed in a lot of places), and the request handler inspired by sitemap scraper in WCC.

There has been discussion of how to avoid duplicating code between wcc and here, and some advised to extract sitemap scraper into a package. But I then checked the code of sitemap crawler in WCC, and it's really coupled to wcc, and itself is quite short, so I just copied it over with modifications.

BUT, the one thing I copied without changes at all is discoverValidSitemaps util from WCC. I'd like to extract it somewhere e.g. into scraper-tools, because it seems like quite a generic function.

Tested locally - for now will just push dataset items with a url and status code for each page.

Closes apify/apify-sdk-js#486

foxt451 · 2025-12-18T20:21:22Z

packages/actor-scraper/sitemap-scraper/src/internals/tools.ts

@@ -0,0 +1,126 @@
+import type { ProxyConfiguration } from 'crawlee';


this file was fully copied from WCC, but I'd like to move these helpers into a shared lib

foxt451 added 4 commits November 14, 2025 14:45

feat: add sitemap-scraper

a8a1aec

chore: remove unused

adeef7d

Merge branch 'master' into feat/sitemap-scraper

2c76f68

fix: remarks

b413241

foxt451 changed the title ~~Feat/sitemap scraper~~ Add sitemap scraper Dec 18, 2025

foxt451 commented Dec 18, 2025

View reviewed changes

foxt451 added 5 commits December 18, 2025 22:32

chore: move package

df66d41

fix: build errors

57e5650

chore: delete package-lock.json

7280392

fix: dockerfile

18e10e5

chore: update INPUT_SCHEMA.json

b9c5680

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sitemap scraper #205

Add sitemap scraper #205

Uh oh!

foxt451 commented Dec 18, 2025

Uh oh!

foxt451 Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,126 @@
		import type { ProxyConfiguration } from 'crawlee';

Add sitemap scraper #205

Are you sure you want to change the base?

Add sitemap scraper #205

Uh oh!

Conversation

foxt451 commented Dec 18, 2025

Uh oh!

foxt451 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant