fix: respect `EnqueueLinksKwargs` for `extract_links` function by Mantisus · Pull Request #1213 · apify/crawlee-python
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the support for EnqueueLinksKwargs parameters for the extract_links function in multiple crawler implementations. The key changes include:
- Adding new kwargs support (e.g., include, exclude, and limit) for link extraction in both Playwright and Abstract HTTP crawlers.
- Updating related tests in Playwright, Parsel, and BeautifulSoup crawlers to validate the new parameters.
- Updating import statements to include Glob from crawlee.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/crawlers/_playwright/test_playwright_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| tests/unit/crawlers/_parsel/test_parsel_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| src/crawlee/crawlers/_playwright/_playwright_crawler.py | Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs. |
| src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py | Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed confusing, that the extract_links has the EnqueueLinksKwargsand not actually using them. I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though. Maybe it would be sufficient to remove EnqueueLinksKwargs from extract_links and keep it only on add_requests?
I think that even the issue reporter could have worked around the issue, by just passing the EnqueueLinksKwargs to the add_requests call even if it contained unfiltered urls.
I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.
Yeah, I thought of that too and I don't like it too.
But it could be useful for users. Like in your example, when the user extracts links by applying filters from EnqueueLinksKwargs and then does something else with them.
So it's a choice between user-friendliness and some redundancy in the code
I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.
Yeah, I thought of that too and I don't like it too.
But it could be useful for users. Like in your example, when the user extracts links by applying filters from
EnqueueLinksKwargsand then does something else with them.So it's a choice between user-friendliness and some redundancy in the code
You are right, it is more straight forward and more convenient for the users. I can't really argue against my own example :-)
I think this is fine as a fix and we should merge it. As the next step, do you think we could make some shared function for applying EnqueueLinksKwargs to a bunch of requests? I believe it would make the resulting code easier to navigate.
As the next step, do you think we could make some shared function for applying
EnqueueLinksKwargsto a bunch of requests?
I totally agree!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters