fix: respect `EnqueueLinksKwargs` for `extract_links` function by Mantisus · Pull Request #1213 · apify/crawlee-python

@Mantisus

@Mantisus

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the support for EnqueueLinksKwargs parameters for the extract_links function in multiple crawler implementations. The key changes include:

  • Adding new kwargs support (e.g., include, exclude, and limit) for link extraction in both Playwright and Abstract HTTP crawlers.
  • Updating related tests in Playwright, Parsel, and BeautifulSoup crawlers to validate the new parameters.
  • Updating import statements to include Glob from crawlee.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/crawlers/_playwright/test_playwright_crawler.py Updated test to use new kwargs parameters and added Glob import.
tests/unit/crawlers/_parsel/test_parsel_crawler.py Updated test to use new kwargs parameters and added Glob import.
tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py Updated test to use new kwargs parameters and added Glob import.
src/crawlee/crawlers/_playwright/_playwright_crawler.py Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs.

Pijukatel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed confusing, that the extract_links has the EnqueueLinksKwargsand not actually using them. I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though. Maybe it would be sufficient to remove EnqueueLinksKwargs from extract_links and keep it only on add_requests?

I think that even the issue reporter could have worked around the issue, by just passing the EnqueueLinksKwargs to the add_requests call even if it contained unfiltered urls.

@Mantisus

I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.

Yeah, I thought of that too and I don't like it too.

But it could be useful for users. Like in your example, when the user extracts links by applying filters from EnqueueLinksKwargs and then does something else with them.

So it's a choice between user-friendliness and some redundancy in the code

@Pijukatel

I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.

Yeah, I thought of that too and I don't like it too.

But it could be useful for users. Like in your example, when the user extracts links by applying filters from EnqueueLinksKwargs and then does something else with them.

So it's a choice between user-friendliness and some redundancy in the code

You are right, it is more straight forward and more convenient for the users. I can't really argue against my own example :-)

Pijukatel

@janbuchar

I think this is fine as a fix and we should merge it. As the next step, do you think we could make some shared function for applying EnqueueLinksKwargs to a bunch of requests? I believe it would make the resulting code easier to navigate.

@Mantisus

As the next step, do you think we could make some shared function for applying EnqueueLinksKwargs to a bunch of requests?

I totally agree!