Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The framework was mostly copied from cheerio-scraper (but trimmed in a lot of places), and the request handler inspired by sitemap scraper in WCC.
There has been discussion of how to avoid duplicating code between wcc and here, and some advised to extract sitemap scraper into a package. But I then checked the code of sitemap crawler in WCC, and it's really coupled to wcc, and itself is quite short, so I just copied it over with modifications.
BUT, the one thing I copied without changes at all is
discoverValidSitemapsutil from WCC. I'd like to extract it somewhere e.g. intoscraper-tools, because it seems like quite a generic function.Tested locally - for now will just push dataset items with a url and status code for each page.
Closes apify/apify-sdk-js#486