UTF-8 question from Dive into Python 3
Terry Reedy
tjreedy at udel.edu
Wed Jan 19 17:33:43 EST 2011
More information about the Python-list mailing list
Wed Jan 19 17:33:43 EST 2011
- Previous message (by thread): UTF-8 question from Dive into Python 3
- Next message (by thread): UTF-8 question from Dive into Python 3
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 1/19/2011 1:02 PM, Tim Harig wrote: > Right, but I only have to do that once. After that, I can directly address > any piece of the stream that I choose. If I leave the information as a > simple UTF-8 stream, I would have to walk the stream again, I would have to > walk through the the first byte of all the characters from the beginning to > make sure that I was only counting multibyte characters once until I found > the character that I actually wanted. Converting to a fixed byte > representation (UTF-32/UCS-4) or separating all of the bytes for each > UTF-8 into 6 byte containers both make it possible to simply index the > letters by a constant size. You will note that Python does the former. The idea of using a custom fixed-width padded version of a UTF-8 steams waw initially shocking to me, but I can imagine that there are specialized applications, which slice-and-dice uninterpreted segments, for which that is appropriate. However, it is not germane to the folly of prefixing standard UTF-8 steams with a 3-byte magic number, mislabelled a 'byte-order-mark, thus making them non-standard. -- Terry Jan Reedy
- Previous message (by thread): UTF-8 question from Dive into Python 3
- Next message (by thread): UTF-8 question from Dive into Python 3
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list