Issue 8383: pickle is unable to encode unicode surrogates

Created on 2010-04-13 00:39 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6) msg102996 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 00:39
Python3 uses unicode surrogates to store undecodable filenames. Eg. the filename b"abc\xff.py" is encoded as "abc\xdcff.py" if the file system encoding is ASCII. Pickle is unable to store them:

./python -c 'import pickle; pickle.dumps("abc\udcff")'
(...)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 20: surrogates not allowed

This is a limitation of pickle (in the binary mode): Python accepts to store any unicode character, but pickle doesn't.

Using "surrogatepass" error handler should be enough to fix this issue.

Related issue: #3672 (Reject surrogates in utf-8 codec) -> r72208 creates "surrogatepass" error handler.
msg102997 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 00:51
I found this bug indirectly: test_logging failed on a SocketHandler if LogRecord.pathname contains a surrogate character. SocketHandler uses pickle to serialize the record.
msg103022 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-13 09:01
Both pickle and marshal will need to use the new error handler in order to stay compatible with Python 3.0 (and 2.x) and also to enable creating Unicode literals that include lone surrogates.
msg103029 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 09:28
> Both pickle and marshal will need to use the new error handler 
> in order to stay compatible with Python 3.0 (and 2.x) 
> and also to enable creating Unicode literals that include 
> lone surrogates.

Attached patch fixes pickle. Marshal does already use surrogatepass since Martin's commit r72208 (Issue #3672).
msg103030 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-13 09:44
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Both pickle and marshal will need to use the new error handler 
>> in order to stay compatible with Python 3.0 (and 2.x) 
>> and also to enable creating Unicode literals that include 
>> lone surrogates.
> 
> Attached patch fixes pickle. Marshal does already use surrogatepass since Martin's commit r72208 (Issue #3672).

Looks good !

Thanks.
msg103034 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-04-13 11:10
Commited: r80031 (py3k) and r80032 (3.1), fix also pickletools.