Add discover valid sitemaps utility (port from JS)

Summary

Port the discoverValidSitemaps() utility from Crawlee JS to Python.

JS source: packages/utils/src/internals/sitemap.ts#3392

How it works in JS

async function* discoverValidSitemaps(
    urls: string[],
    options?: { proxyUrl?: string; httpClient?: BaseHttpClient }
): AsyncIterable<string>
  1. Group input URLs by hostname
  2. For each domain, discover sitemaps from (in order):
    • Sitemap: entries in robots.txt
    • Input URLs that match /sitemap\.(xml|txt)(\.gz)?$/i
    • HEAD-request probing of /sitemap.xml, /sitemap.txt, /sitemap_index.xml (fallback)
  3. Deduplicate and process domains concurrently

Returns an async iterable yielding sitemap URLs as discovered.

What Python already has

  • Sitemap.try_common_names() — probes /sitemap.xml and /sitemap.txt for a single URL (missing /sitemap_index.xml)
  • RobotsTxtFile.find() + get_sitemaps() — fetches and extracts Sitemap: entries from robots.txt

What's missing: the orchestrating function that combines these steps, groups by hostname, validates via HEAD requests, detects direct sitemap URLs from input, and processes domains concurrently.