bpo-31993: Do not use memoryview when pickle large strings. by serhiy-storchaka · Pull Request #5154 · python/cpython

@serhiy-storchaka

PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.

https://bugs.python.org/issue31993

@serhiy-storchaka

PyMemoryView_FromMemory() created a memoryview referring to
the internal data of the string.  When the string is destroyed
the memoryview become referring to a freed memory.

ogrisel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize but off-course I agree that the priority is to fix the safety issue.

@@ -2179,57 +2179,60 @@ def write(self, chunk):
def concatenate_chunks(self):
# Some chunks can be memoryview instances, we need to convert
# them to bytes to be able to call join

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should now be removed.


# Actually read the binary content of the chunks after the end
# of the call to dump: ant memoryview passed to write should not
# of the call to dump: and memoryview passed to write should not

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "any memoryview" instead.

self.assertGreaterEqual(9, chunk_size)
self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET,
chunk_sizes)
# There shouldn't bee too much small chunks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.

if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str object to avoid the large memory overhead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@serhiy-storchaka

serhiy-storchaka

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @ogrisel.

if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

@pitrou

This looks fine to me. Does the previous NEWS file need updating?

@serhiy-storchaka

ogrisel

The picklers no longer allocate temporary memory when dumping large
``bytes`` and ``str`` objects into a file object. Instead the data is
directly streamed into the underlying file object.
The pickler now uses less memory when serialize large bytes and str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when serializing

@serhiy-storchaka