fix: Improve `Request.user_data` serialization by janbuchar ยท Pull Request #540 ยท apify/crawlee-python

@janbuchar

@janbuchar

@janbuchar janbuchar added the t-tooling

Issues with this label are in the ownership of the tooling team.

label

Sep 23, 2024

@janbuchar

janbuchar

Comment on lines +168 to +176

async def test_invalid_user_data_serialization() -> None:
with pytest.raises(ValidationError):
Request.from_url(
'https://crawlee.dev',
user_data={
'foo': datetime(year=2020, month=7, day=4, tzinfo=timezone.utc),
'bar': {datetime(year=2020, month=4, day=7, tzinfo=timezone.utc)},
},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where to put this one, but it is similar to the ones that follow...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can move it to a separate file test_request.py.

vdusek

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good job, thanks.

I just wonder whether we should not separate "user data" and "our Crawlee metadata". Currently, everything is stored in the user_data field. Let's say the Request would have 2 fields - user_data and crawlee_data / crawlee_metadata. Or is there any problem/drawback with this separation?

Comment on lines +168 to +176

async def test_invalid_user_data_serialization() -> None:
with pytest.raises(ValidationError):
Request.from_url(
'https://crawlee.dev',
user_data={
'foo': datetime(year=2020, month=7, day=4, tzinfo=timezone.utc),
'bar': {datetime(year=2020, month=4, day=7, tzinfo=timezone.utc)},
},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can move it to a separate file test_request.py.

@janbuchar

I just wonder whether we should not separate "user data" and "our Crawlee metadata". Currently, everything is stored in the user_data field. Let's say the Request would have 2 fields - user_data and crawlee_data / crawlee_metadata. Or is there any problem/drawback with this separation?

Something something platform ๐Ÿ™‚ We need to persist this stuff when running on Apify, so it needs to be a part of user_data in the end. I guess we could separate the representation of "Crawlee Request" and "Payload for enqueue request API call" better, probably with/after #94.

@vdusek

Something something platform ๐Ÿ™‚ We need to persist this stuff when running on Apify, so it needs to be a part of user_data in the end. I guess we could separate the representation of "Crawlee Request" and "Payload for enqueue request API call" better, probably with/after #94.

Thanks. Let's merge it then.

@B4nan

Something something platform ๐Ÿ™‚

Correct, the request fields are validated, we cant just add something the platform doesn't expect.

@janbuchar janbuchar deleted the fix-user-data-serialization branch

September 23, 2024 12:37