bpo-31993: Do not use memoryview when pickle large strings. by serhiy-storchaka · Pull Request #5154 · python/cpython

bpo-31993: Do not use memoryview when pickle large strings. by serhiy-storchaka · Pull Request #5154 · python/cpython

PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.

https://bugs.python.org/issue31993

PyMemoryView_FromMemory() created a memoryview referring to
the internal data of the string.  When the string is destroyed
the memoryview become referring to a freed memory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize but off-course I agree that the priority is to fix the safety issue.

		@@ -2179,57 +2179,60 @@ def write(self, chunk):
		def concatenate_chunks(self):
		# Some chunks can be memoryview instances, we need to convert
		# them to bytes to be able to call join

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should now be removed.


		# Actually read the binary content of the chunks after the end
		# of the call to dump: ant memoryview passed to write should not
		# of the call to dump: and memoryview passed to write should not

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "any memoryview" instead.

		self.assertGreaterEqual(9, chunk_size)
		self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET,
		chunk_sizes)
		# There shouldn't bee too much small chunks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.

		if (payload == NULL) {
		payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
		PyBUF_READ);
		payload = mem = PyBytes_FromStringAndSize(data, data_size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str object to avoid the large memory overhead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @ogrisel.

		if (payload == NULL) {
		payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
		PyBUF_READ);
		payload = mem = PyBytes_FromStringAndSize(data, data_size);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

This looks fine to me. Does the previous NEWS file need updating?

		The picklers no longer allocate temporary memory when dumping large
		``bytes`` and ``str`` objects into a file object. Instead the data is
		directly streamed into the underlying file object.
		The pickler now uses less memory when serialize large bytes and str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when serializing