bpo-31993: Do not use memoryview when pickle large strings. by serhiy-storchaka · Pull Request #5154 · python/cpython
PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.
PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize but off-course I agree that the priority is to fix the safety issue.
| @@ -2179,57 +2179,60 @@ def write(self, chunk): | |||
| def concatenate_chunks(self): | |||
| # Some chunks can be memoryview instances, we need to convert | |||
| # them to bytes to be able to call join | |||
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should now be removed.
|
|
||
| # Actually read the binary content of the chunks after the end | ||
| # of the call to dump: ant memoryview passed to write should not | ||
| # of the call to dump: and memoryview passed to write should not |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should read "any memoryview" instead.
| self.assertGreaterEqual(9, chunk_size) | ||
| self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET, | ||
| chunk_sizes) | ||
| # There shouldn't bee too much small chunks. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.
| if (payload == NULL) { | ||
| payload = mem = PyMemoryView_FromMemory((char *) data, data_size, | ||
| PyBUF_READ); | ||
| payload = mem = PyBytes_FromStringAndSize(data, data_size); |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str object to avoid the large memory overhead?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.
Here I just try to fix a regression introduced by #4353.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review @ogrisel.
| if (payload == NULL) { | ||
| payload = mem = PyMemoryView_FromMemory((char *) data, data_size, | ||
| PyBUF_READ); | ||
| payload = mem = PyBytes_FromStringAndSize(data, data_size); |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.
Here I just try to fix a regression introduced by #4353.
| The picklers no longer allocate temporary memory when dumping large | ||
| ``bytes`` and ``str`` objects into a file object. Instead the data is | ||
| directly streamed into the underlying file object. | ||
| The pickler now uses less memory when serialize large bytes and str |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when serializing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters