What encoding does u'...' syntax use?
Adam Olsen
rhamph at gmail.com
Sat Feb 21 15:39:22 EST 2009
More information about the Python-list mailing list
Sat Feb 21 15:39:22 EST 2009
- Previous message (by thread): What encoding does u'...' syntax use?
- Next message (by thread): What encoding does u'...' syntax use?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Feb 21, 10:48 am, a... at pythoncraft.com (Aahz) wrote: > In article <499F397C.7030... at v.loewis.de>, > > =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <mar... at v.loewis.de> wrote: > >> Yes, I know that. But every concrete representation of a unicode string > >> has to have an encoding associated with it, including unicode strings > >> produced by the Python parser when it parses the ascii string "u'\xb5'" > > >> My question is: what is that encoding? > > >The internal representation is either UTF-16, or UTF-32; which one is > >a compile-time choice (i.e. when the Python interpreter is built). > > Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the > countless threads about the distinction between UTF and UCS? Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer to Unicode 1.1 and earlier, with no surrogates. We target Unicode 5.1. If you naively encode UCS-2 as UTF-8 you really end up with CESU-8. You miss the step where you combine surrogate pairs (which only exist in UTF-16) into a single supplementary character. Lo and behold, that's actually what current python does in some places. It's not pretty. See bugs #3297 and #3672.
- Previous message (by thread): What encoding does u'...' syntax use?
- Next message (by thread): What encoding does u'...' syntax use?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list