fix: Do not raise an error to check 'same-domain' if there is no hostname in the url by Mantisus · Pull Request #1251 · apify/crawlee-python

@Mantisus

Description

  • The need for PR is due to the fact that now the enqueue strategy check, is used in extract_links and can be applied to invalid urls.

@Mantisus

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the behavior of the same-domain enqueue strategy to return False instead of raising an error when either URL lacks a hostname, preventing exceptions on invalid URLs.

  • Replace ValueError with a silent skip (return False) for missing hostnames in same-domain checks.
  • Ensures extract_links can apply the enqueue strategy without erroring on malformed URLs.
Comments suppressed due to low confidence (2)

src/crawlee/crawlers/_basic/_basic_crawler.py:970

  • Add a unit test for the case where origin_url.hostname or target_url.hostname is None with the same-domain strategy to verify that it returns False and does not enqueue.

src/crawlee/crawlers/_basic/_basic_crawler.py:967

  • Update the method docstring for _check_enqueue_strategy to document that URLs without hostnames will simply return False for the same-domain strategy instead of raising an error.
def _check_enqueue_strategy(

@Mantisus

janbuchar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable 👍

@Mantisus

@vdusek vdusek changed the title fix: Do not raise an error to check ‘same-domain’ if there is no hostname in the url fix: Do not raise an error to check 'same-domain' if there is no hostname in the url

Jun 17, 2025

vdusek

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM