Encoding of surrogate code points to UTF-8
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Oct 8 18:20:58 EDT 2013
More information about the Python-list mailing list
Tue Oct 8 18:20:58 EDT 2013
- Previous message (by thread): Encoding of surrogate code points to UTF-8
- Next message (by thread): Encoding of surrogate code points to UTF-8
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote: > The only time you should get a surrogate pair in a Unicode string is in > a narrow build, which doesn't exist in Python 3.3 and later. Incorrect. py> sys.version '3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)]' py> s = '\ud800\udc01' py> print(len(s)) 2 py> import unicodedata as ud py> for c in s: ... print(ud.category(c)) ... Cs Cs s is a string containing two code points making up a surrogate pair. It is very frustrating that the Unicode FAQs don't always clearly distinguish between when they are talking about bytes and when they are talking about code points. This area about surrogates is one of places where they conflate the two. -- Steven
- Previous message (by thread): Encoding of surrogate code points to UTF-8
- Next message (by thread): Encoding of surrogate code points to UTF-8
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list