Add discover valid sitemaps utility (port from JS)
Summary
Port the discoverValidSitemaps() utility from Crawlee JS to Python.
JS source: packages/utils/src/internals/sitemap.ts — #3392
How it works in JS
async function* discoverValidSitemaps( urls: string[], options?: { proxyUrl?: string; httpClient?: BaseHttpClient } ): AsyncIterable<string>
- Group input URLs by hostname
- For each domain, discover sitemaps from (in order):
Sitemap:entries in robots.txt- Input URLs that match
/sitemap\.(xml|txt)(\.gz)?$/i - HEAD-request probing of
/sitemap.xml,/sitemap.txt,/sitemap_index.xml(fallback)
- Deduplicate and process domains concurrently
Returns an async iterable yielding sitemap URLs as discovered.
What Python already has
Sitemap.try_common_names()— probes/sitemap.xmland/sitemap.txtfor a single URL (missing/sitemap_index.xml)RobotsTxtFile.find()+get_sitemaps()— fetches and extractsSitemap:entries from robots.txt
What's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.