[Python-Dev] please consider changing --enable-unicode default to ucs4
Adam Olsen
rhamph at gmail.com
Thu Oct 8 02:10:25 CEST 2009
More information about the Python-Dev mailing list
Thu Oct 8 02:10:25 CEST 2009
- Previous message: [Python-Dev] Python 2.6.4rc1
- Next message: [Python-Dev] please consider changing --enable-unicode default to ucs4
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog at gmail.com> wrote: > On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis at pitrou.net> wrote: >> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting. > > That would be an improvement! Unfortunately we instead get mysterious > misbehavior of the module, e.g.: > > http://bugs.python.org/setuptools/msg309 > http://allmydata.org/trac/tahoe/ticket/704#comment:5 The real issue here is getting confused because python's option is misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This means that when decoding UTF-8, any scalar value outside the BMP will be split into a pair of surrogates on UTF-16 builds; if we were using UCS-2 that'd be an error instead (and *nothing* would understand surrogates.) Yet we are getting an error here. However, if you look at the details you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding in the second link to U+6E657770. Although the originally UTF-8 left open the possibility of including up to 31 bits (or U+7FFFFFFF), this was removed in RFC 3629 and is now strictly prohibited. The modern unicode character set itself also imposes that restriction. There is nothing beyond U+10FFFF. Nothing should create a such a high code point, and even if it happened internally a RFC 3629-conformant UTF-8 encoder must refuse to pass it through. Something more subtle must be going on. Possibly several bugs (such as a non-conformant encoder or garbage being misinterpreted as UTF-8). -- Adam Olsen, aka Rhamphoryncus
- Previous message: [Python-Dev] Python 2.6.4rc1
- Next message: [Python-Dev] please consider changing --enable-unicode default to ucs4
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list