Python's handling of unicode surrogates
Pete Forman
pete.forman at westerngeco.com
Tue Apr 24 07:43:16 EDT 2007
More information about the Python-list mailing list
Tue Apr 24 07:43:16 EDT 2007
- Previous message (by thread): Python's handling of unicode surrogates
- Next message (by thread): Python's handling of unicode surrogates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ross Ridge <rridge at caffeine.csclub.uwaterloo.ca> writes: > The Unicode standard doesn't require that you support surrogates, > or any other kind of character, so no you wouldn't be lying. +1 on Ross Ridge's contributions to this thread. If Unicode is processed using UTF-8 or UTF-32 encoding forms then there are no surrogates. They would only be present in UTF-16. CESU-8 is strongly discouraged. A Unicode 16-bit string is allowed to be ill-formed as UTF-16. The example they give is one string that ends with a high surrogate code point and another that starts with a low surrogate code point. The result of concatenation is a valid UTF-16 string. The above refers to the Unicode standard. In Python with narrow Py_UNICODE a unicode string is a sequence of 16-bit Unicode code points. It is up to the programmer whether they want to specially handle code points for surrogates. Operations based on concatenation will conform to Unicode, whether or not there are surrogates in the strings. -- Pete Forman -./\.- Disclaimer: This post is originated WesternGeco -./\.- by myself and does not represent pete.forman at westerngeco.com -./\.- the opinion of Schlumberger or http://petef.port5.com -./\.- WesternGeco.
- Previous message (by thread): Python's handling of unicode surrogates
- Next message (by thread): Python's handling of unicode surrogates
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list