feat: Add `transform_request_function` parameter for `SitemapRequestLoader` by Mantisus · Pull Request #1525 · apify/crawlee-python
Description
This PR is inspired by this discussion.
Add support transform_request_function for SitemapRequestLoader, which works the same way as in EnqueueLinksFunction. This can be useful for setting label for correct routing or user_data with custom data.
| exclude: list[re.Pattern[Any] | Glob] | None = None, | ||
| max_buffer_size: int = 200, | ||
| persist_state_key: str | None = None, | ||
| transform_request_function: Callable[[RequestOptions], RequestOptions | RequestTransformAction] | None = None, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buuut... shouldn't this also receive an URL of the origin sitemap?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that makes sense. A sitemap cannot contain links to another domain.
This way, users can easily create a mapping between the original link to the sitemap and the link inside transform_request_function. From my point of view, the most valuable thing that adding transform_request_function gives is the ability to add a label so that the request is processed by the appropriate handler.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense, thanks. But I'm afraid that this won't "click" for a lot of people. Perhaps we could add an example that showcases this?
| def transform_request( | ||
| request_options: RequestOptions, | ||
| ) -> RequestOptions | RequestTransformAction: | ||
| request_host = URL(request_options['url']).host |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should mention that a sitemap should only contain links to the same host here
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added.
Mantisus
changed the title
feat: add
feat: Add transform_request_function parameter for SitemapRequestLoadertransform_request_function parameter for SitemapRequestLoader
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters