feat: add `discover_valid_sitemaps` utility by Mantisus · Pull Request #1777 · apify/crawlee-python

Pull request overview

This PR ports/introduces a Python discover_valid_sitemaps helper to discover sitemap URLs for a set of input URLs (robots.txt sitemaps, direct sitemap URLs, and common sitemap paths), aligning with issue #1740.

Changes:

  • Add discover_valid_sitemaps() (plus internal helpers/constants) to orchestrate sitemap discovery per-hostname and deduplicate results.
  • Extend common sitemap probing to include /sitemap_index.xml and add is_status_code_successful() for status evaluation.
  • Add unit tests covering robots.txt discovery, common-path probing, input URL detection, deduplication, and multi-domain behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/crawlee/_utils/sitemap.py Implements sitemap discovery orchestration, common-path probing, and async generator merging.
src/crawlee/_utils/web.py Adds a helper to classify 2xx/3xx responses as “successful”.
tests/unit/_utils/test_sitemap.py Adds unit tests for the new sitemap discovery utility with mocked HTTP behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.