[Python-Dev] Unicode exception indexing
Victor Stinner
victor.stinner at haypocalc.com
Thu Nov 3 20:16:21 CET 2011
More information about the Python-Dev mailing list
Thu Nov 3 20:16:21 CET 2011
- Previous message: [Python-Dev] Unicode exception indexing
- Next message: [Python-Dev] Unicode exception indexing
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Le jeudi 3 novembre 2011 18:14:42, martin at v.loewis.de a écrit : > There is a backwards compatibility issue with PEP 393 and Unicode > exceptions: the start and end indices: are they Py_UNICODE indices, or > code point indices? Oh oh. That's exactly why I didn't want to start to work on this issue. http://bugs.python.org/issue13064 In a Python error handler, exc.object[exc.start:exc.end] should be used to get the unencodable/undecodable substring. In a C error handler, it depends if you use a Py_UNICODE* pointer or PyUnicode_Substring() / PyUnicode_READ. Using google.fr/codesearch, I found some user error handlers implemented in Python: * straw: "html_replace" * Nuxeo: "latin9_fallback" * peerscape: "htmlentityescape" * pymt: "cssescape" * .... I found no error implemented in C (not any call to PyCodec_RegisterError). > So what should it be? I suggest to use code point indices. Code point indices is also now more "natural" with the PEP 393. Because it is an incompatible change, it should be documented in the PEP and in the "What's new in Python 3.3" document. > As a compromise, it would be possible to convert between these indices, > by counting the non-BMP characters that precede the index if the indices > might differ. I started such hack for the UTF-8 codec... It is really tricky, we should not do that! > That would be expensive to compute Yeah, O(n) should be avoided when is it possible. -- FYI I implemented a proof-of-concept in Python of the surrogateescape error handler for Python 2 (for Mercurial): https://bitbucket.org/haypo/misc/src/tip/python/surrogateescape.py Victor
- Previous message: [Python-Dev] Unicode exception indexing
- Next message: [Python-Dev] Unicode exception indexing
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-Dev mailing list