Message 115284 - Python tracker

Message115284

Author	kristjan.jonsson
Recipients	BreamoreBoy, anthonybaxter, brett.cannon, ezio.melotti, kristjan.jonsson, loewis, nnorwitz, theller, vstinner
Date	2010-09-01.01:23:11
SpamBayes Score	2.0217343e-09
Marked as misclassified	No
Message-id	<1283304195.04.0.0745873924129.issue1552880@psf.upfronthosting.co.za>
In-reply-to

Content
I conffess that I didn't follow the utf-8/surrogate discussion. But the utf-8 encoding can encode all valid unicode characters: UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. (from wikipedia) If we encounter surrogate halves when encoding (unicode) to utf-8, it means that we are really trying to decode utf-16 and reencode it as utf-8. (and that python is using 16 bits for its unicode chars). the utf--8 codec should be smart enough to merge the surrogates into a utf-32 char, and encode that. Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) work in an unicode environment, with the least amount of code change, for those willing to commit such a patch. In 3.x you may want to do things differently.

Content

I conffess that I didn't follow the utf-8/surrogate discussion.
But the utf-8 encoding can encode all valid unicode characters:

UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. (from wikipedia)

If we encounter surrogate halves when encoding (unicode) to utf-8, it means that we are really trying to decode utf-16 and reencode it as utf-8.  (and that python is using 16 bits for its unicode chars).  the utf--8 codec should be smart enough to merge the surrogates into a utf-32 char, and encode that.

Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) work in an unicode environment, with the least amount of code change, for those willing to commit such a patch.  In 3.x you may want to do things differently.

History
Date	User	Action	Args
2010-09-01 01:23:15	kristjan.jonsson	set	recipients: + kristjan.jonsson, loewis, nnorwitz, brett.cannon, anthonybaxter, theller, vstinner, ezio.melotti, BreamoreBoy
2010-09-01 01:23:15	kristjan.jonsson	set	messageid: <1283304195.04.0.0745873924129.issue1552880@psf.upfronthosting.co.za>
2010-09-01 01:23:13	kristjan.jonsson	link	issue1552880 messages
2010-09-01 01:23:12	kristjan.jonsson	create