Umlauts, encodings, sitecustomize.py
Jeff Epler
jepler at unpythonic.net
Tue Nov 9 15:07:58 EST 2004
More information about the Python-list mailing list
Tue Nov 9 15:07:58 EST 2004
- Previous message (by thread): Umlauts, encodings, sitecustomize.py
- Next message (by thread): Escape chars in string
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Nov 09, 2004 at 07:52:58PM +0100, F. GEIGER wrote: > "Jeff Epler" <jepler at unpythonic.net> schrieb im Newsbeitrag > news:mailman.6167.1100019195.5135.python-list at python.org... > > > You should note that chr(0x84) is *not* a-umlaut in iso-8859-1. That's > chr(0xe4). You may be using one of these Windows-specific encodings: > > cp437.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp775.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp850.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp852.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp857.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp861.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > cp865.py: 0x0084: 0x00e4, # LATIN SMALL LETTER A WITH DIAERESIS > > I'm not sure what you mean by this. Do mean I use one of these > accidentially? Or should I switch to one of these in my sitecutsomize.py? > > I'm a bit confused. When I let Python print an � (umlaut a) by simply > entering the 1-char string "�", it prints '\x84'. In the encoding iso-8859-1, the character chr(0xe4) is LATIN SMALL LETTER A WITH DIAERESIS. chr(0x84) is not a printable character. In the encodings I named above, chr(0x84) is LATIN SMALL LETTER A WITH DIAERESIS. Now, consider this program that creates a program: def maker(filename, encoding, ch): f = open(filename, "w") f.write("# -*- coding: %s -*-\n" % encoding) f.write("print '%s'\n" % ch) if you call maker("coded.py", "iso-8859-1", "\xe4") the created script will contain a byte string literal with the byte '\xe4' in it. When you run the script, it will print that byte followed by the byte '\n'. *In fact, this behavior (sequence of bytes written to sys.stdout) doesn't depend on encoding, as long as '\xe4'.decode(encoding).encode(encoding) == '\xe4' which should hold true in almost all single-byte encodings.* What you *see* when you run the script depends on the meaning your terminal window ("DOS box") assigns to the byte sequence '\xe4\n'. On mine, which expects output in UTF-8, I get a mark which indicates an incomplete multi-byte character and then a newline. On yours, you apparently get some other character, possibly LATIN SMALL LETTER O WITH TILDE if your terminal uses cp770, cp850, or cp857. Now, consider this program with a u''-string literal: def umaker(filename, encoding, ch): f = open(filename, "w") f.write("# -*- coding: %s -*-\n" % encoding) f.write("print u'%s'\n" % ch) If you call umaker("ucoded.py", "iso-8859-1", "\xe4") the created script will again contain the literal byte "\xe4". When you run the script, you may get an error that says UnicodeError: ASCII encoding error: ordinal not in range(128) this is because the string to be printed is a unicode string containing the letter LATIN SMALL LETTER A WITH DIAERESIS, but Python believes the terminal can only accept ASCII-encoded strings for display. In my Python 2.3 on Unix, sys.stdout.encoding is "UTF-8", and running ucoded.py outputs the 3 byte sequence "\303\244\n", which in UTF-8 is a LATIN SMALL LETTER A WITH DIAERESIS followed by a carriage return. I suspect that wxpython is like tkinter: It is designed so that u''-strings (unicode strings) can be given as arguments anywhere strings can, and that internally the necessary steps are taken to find the proper glyphs in the font to display that string. Otherwise, there may be a particular encoding assumed for all byte strings, which will have no relationship to the -*- coding -*- of your scripts. Jeff -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-list/attachments/20041109/67715412/attachment.sig>
- Previous message (by thread): Umlauts, encodings, sitecustomize.py
- Next message (by thread): Escape chars in string
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list