feat: add `discover_valid_sitemaps` utility by Mantisus · Pull Request #1777 · apify/crawlee-python
Pull request overview
This PR ports/introduces a Python discover_valid_sitemaps helper to discover sitemap URLs for a set of input URLs (robots.txt sitemaps, direct sitemap URLs, and common sitemap paths), aligning with issue #1740.
Changes:
- Add
discover_valid_sitemaps()(plus internal helpers/constants) to orchestrate sitemap discovery per-hostname and deduplicate results. - Extend common sitemap probing to include
/sitemap_index.xmland addis_status_code_successful()for status evaluation. - Add unit tests covering robots.txt discovery, common-path probing, input URL detection, deduplication, and multi-domain behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/crawlee/_utils/sitemap.py |
Implements sitemap discovery orchestration, common-path probing, and async generator merging. |
src/crawlee/_utils/web.py |
Adds a helper to classify 2xx/3xx responses as “successful”. |
tests/unit/_utils/test_sitemap.py |
Adds unit tests for the new sitemap discovery utility with mocked HTTP behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.