Issue4213
Created on 2008-10-27 13:29 by christian.heimes, last changed 2022-04-11 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| get_codeset.patch | vstinner, 2008-10-27 14:35 | Use lookup(codeset).name as charset | ||
| Messages (10) | |||
|---|---|---|---|
| msg75252 - (view) | Author: Christian Heimes (christian.heimes) * ![]() |
Date: 2008-10-27 13:29 | |
Python should lower case the file system encoding in Python/pythonrun.c. On several occasions Python optimizes code paths for lower case encodings like "utf-8" or "latin-1". On my Ubuntu system the file system encoding is upper case ("UTF-8") and the optimizations aren't used. This also causes problems with sub interpreters #3723 initstdio() in the sub interpreter fails because "UTF-8" must be looked up in the codecs and encoding registry while "utf-8" works like a charm. $ python2.6 -c "import sys; print sys.getfilesystemencoding()" UTF-8 $ python3.0 -c "import sys; print(sys.getfilesystemencoding())" UTF-8 $ locale LANG=de_DE.UTF-8 LANGUAGE=en_US:en:de_DE:de LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL= The patch is trivial: if (codeset) { if (!Py_FileSystemDefaultEncoding) { char *p; for (p=codeset; *p; p++) *p = tolower(*p); Py_FileSystemDefaultEncoding = codeset; } else free(codeset); } Python/codecs.c:normalizestring() does a similar job. Maybe a new method "char* PyCodec_NormalizeEncodingName(const char*)" could be introduced for the problem. |
|||
| msg75253 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2008-10-27 14:14 | |
Converting to the lower case doesn't solve the problem: if the locale is "utf8" and Python checks for "utf-8", the optimization will fail. Another example: iso-8859-1, latin-1 or latin1? A correct patch would be to get the most common name of the charset and make sure that Python C code always use this name. |
|||
| msg75254 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-10-27 14:30 | |
The lower-casing doesn't hurt, since that's done anyway during codec lookup, but I'd be -1 on making this try to duplicate the aliasing already done by the encodings package. |
|||
| msg75255 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2008-10-27 14:35 | |
Here is a patch to get the "most common charset name": use codecs.lookup(codeset).name. |
|||
| msg75256 - (view) | Author: Christian Heimes (christian.heimes) * ![]() |
Date: 2008-10-27 15:03 | |
Victor's patch fixes the issue with #3723. |
|||
| msg75257 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-10-27 15:05 | |
+1 on adding Viktor's patch. |
|||
| msg75276 - (view) | Author: Christian Heimes (christian.heimes) * ![]() |
Date: 2008-10-28 11:15 | |
Me, too! The solution is elegant and works well. Barry still has to accept the patch, though. |
|||
| msg75384 - (view) | Author: Christian Heimes (christian.heimes) * ![]() |
Date: 2008-10-30 21:40 | |
Fixed in r67055 |
|||
| msg75387 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2008-10-30 22:14 | |
> The solution is elegant and works well. I can't agree with that evaluation. In cases where Python would fail without this patch (i.e. because the file system encoding cannot be found during startup), this solution doesn't work well in general - it only works if the file system encoding happens to be UTF-8. If the file system encoding is not in the list of "builtin" codec names, startup would still fail. r67057 addresses this case in a somewhat more general manner, by falling back to ASCII during startup, for encoding file names. This should work in the usual case where Python is in /usr/bin (say), but it's still possible to make it fail, e.g. if the codecs are in /home/Питон (say), on a system that uses koi8-r as the file system encoding, this bug would persist despite the two patches that have been applied. |
|||
| msg75392 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2008-10-30 22:38 | |
Le Thursday 30 October 2008 23:14:21 Martin v. Löwis, vous avez écrit : > I can't agree with that evaluation. In cases where Python would fail > without this patch (i.e. because the file system encoding cannot be > found during startup), My patch doesn't change the way how Python get the file system encoding: it just gets the "Python charset name" (eg. "utf-8" instead of "UTF8", or "iso8859-1" instead of "latin-1"). The goal was to enable the optimizations, especially with utf-8. It's not related to #3723. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:56:40 | admin | set | nosy:
+ benjamin.peterson github: 48463 |
| 2008-10-30 22:38:38 | vstinner | set | messages: + msg75392 |
| 2008-10-30 22:14:20 | loewis | set | nosy:
+ loewis messages: + msg75387 |
| 2008-10-30 21:40:25 | christian.heimes | set | status: open -> closed resolution: accepted -> fixed messages: + msg75384 |
| 2008-10-28 11:15:49 | christian.heimes | set | resolution: accepted messages: + msg75276 |
| 2008-10-27 15:05:33 | lemburg | set | messages: + msg75257 |
| 2008-10-27 15:03:07 | christian.heimes | set | messages: + msg75256 |
| 2008-10-27 14:35:16 | vstinner | set | files:
+ get_codeset.patch messages: + msg75255 |
| 2008-10-27 14:30:36 | lemburg | set | nosy:
+ lemburg messages: + msg75254 |
| 2008-10-27 14:14:39 | vstinner | set | nosy:
+ vstinner messages: + msg75253 |
| 2008-10-27 13:29:28 | christian.heimes | link | issue3723 dependencies |
| 2008-10-27 13:29:14 | christian.heimes | create | |
