Becoming Unicode Aware
Carl Banks
imbosol at aerojockey.com
Thu Oct 28 05:13:46 EDT 2004
More information about the Python-list mailing list
Thu Oct 28 05:13:46 EDT 2004
- Previous message (by thread): Becoming Unicode Aware
- Next message (by thread): Becoming Unicode Aware
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
fuzzyman at gmail.com (Michael Foord) wrote in message news:<6f402501.0410270256.13cf5727 at posting.google.com>... > My main problem with udnerstanding unicode is what to do with > arbitrary text without an encoding specified. To the best of my > knowledge the technical term for this situation is 'buggered'. E.g. I > have a CGI guestbook script. Is the only way of knowing what encodign > the user is typing in, to ask them ? Generally speaking, you have to ask (either the user or the software). There's no reliable way to tell what encoding you're looking at without someone or something telling you; you might be able to make a heuristical guess, but that's it. > Anyway - ConfigObj reads config files from plain text files. Is there > a standard for specifying the encoding within the text file ? I know > python scripts have a method - should I just use that ? It's a good method if you expect people to be editing the config file with Emacs. It's a good enough method if you haven't any good reason to use another method. > Also - suppose I know the encoding, or let the programmer specify, is > the following sufficient for reading the files in : > > def afunction(setoflines, encoding='ascii'): > for line in setoflines: > if encoding: > line = line.decode(encoding) For most encodings, this'll work fine. But there are some encodings, for example UTF-16, that won't work with it. UTF-16 fails for two reasons: the two-byte characters interfere with the line buffering, and UTF-16 strings must be preceded by a two-byte code indicating endianness, which would be at the beginning of the file but not of each line. Fortunately, most text files aren't in UTF-16. I mention this so that you are aware that, although afunction works in most cases, it is not universal. I believe it's the purpose of the StreamReader and StreamWriter classes in the codecs module to deal with such situations. -- CARL BANKS
- Previous message (by thread): Becoming Unicode Aware
- Next message (by thread): Becoming Unicode Aware
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list