feat: Add `transform_request_function` parameter for `SitemapRequestLoader` by Mantisus · Pull Request #1525 · apify/crawlee-python

@Mantisus

Description

This PR is inspired by this discussion.
Add support transform_request_function for SitemapRequestLoader, which works the same way as in EnqueueLinksFunction. This can be useful for setting label for correct routing or user_data with custom data.

@Mantisus

@Mantisus

Pijukatel

@Mantisus

Pijukatel

janbuchar

exclude: list[re.Pattern[Any] | Glob] | None = None,
max_buffer_size: int = 200,
persist_state_key: str | None = None,
transform_request_function: Callable[[RequestOptions], RequestOptions | RequestTransformAction] | None = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buuut... shouldn't this also receive an URL of the origin sitemap?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that makes sense. A sitemap cannot contain links to another domain.
This way, users can easily create a mapping between the original link to the sitemap and the link inside transform_request_function. From my point of view, the most valuable thing that adding transform_request_function gives is the ability to add a label so that the request is processed by the appropriate handler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense, thanks. But I'm afraid that this won't "click" for a lot of people. Perhaps we could add an example that showcases this?

@Mantisus

janbuchar

janbuchar

def transform_request(
request_options: RequestOptions,
) -> RequestOptions | RequestTransformAction:
request_host = URL(request_options['url']).host

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should mention that a sitemap should only contain links to the same host here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added.

@Mantisus

@Mantisus Mantisus changed the title feat: add transform_request_function parameter for SitemapRequestLoader feat: Add transform_request_function parameter for SitemapRequestLoader

Nov 10, 2025