Issue11322
Created on 2011-02-25 15:55 by lemburg, last changed 2022-04-11 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| encoding_normalize_optimize.patch | methane, 2016-12-15 09:24 | review | ||
| Messages (10) | |||
|---|---|---|---|
| msg129386 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2011-02-25 15:55 | |
I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.
The original version used the .translate() method which is a lot faster and can be adapted to work with the Unicode variant of the .translate() method just as well.
_norm_encoding_map = (' . '
'0123456789 ABCDEFGHIJKLMNOPQRSTUVWXYZ '
' abcdefghijklmnopqrstuvwxyz '
' '
' '
' ')
def normalize_encoding(encoding):
""" Normalize an encoding name.
Normalization works as follows: all non-alphanumeric
characters except the dot used for Python package names are
collapsed and replaced with a single underscore, e.g. ' -;#'
becomes '_'. Leading and trailing underscores are removed.
Note that encoding names should be ASCII only; if they do use
non-ASCII characters, these must be Latin-1 compatible.
"""
# Make sure we have an 8-bit string, because .translate() works
# differently for Unicode strings.
if hasattr(__builtin__, "unicode") and isinstance(encoding, unicode):
# Note that .encode('latin-1') does *not* use the codec
# registry, so this call doesn't recurse. (See unicodeobject.c
# PyUnicode_AsEncodedString() for details)
encoding = encoding.encode('latin-1')
return '_'.join(encoding.translate(_norm_encoding_map).split())
|
|||
| msg129389 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2011-02-25 16:34 | |
I don't think the normalize_encoding() function was the culprit for issue11303 because I measured timings with timeit which averages multiple runs while normalize_encoding() is called only the one time per encoding spelling due to caching. |
|||
| msg129460 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2011-02-25 23:03 | |
We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization): - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters - normalize_encoding() in unicodeobject.c - normalizestring() in codecs.c normalize_encoding() in encodings is more laxist than the two other functions: it normalizes " utf 8 " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é". I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result. |
|||
| msg129463 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2011-02-25 23:06 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization): > > - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters > - normalize_encoding() in unicodeobject.c > - normalizestring() in codecs.c > > normalize_encoding() in encodings is more laxist than the two other functions: it normalizes " utf 8 " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é". > > I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result. Please see this message for an explanation of why we have those three functions, why they are different and what their application space is: http://bugs.python.org/issue5902#msg129257 This ticket is just about the encoding package's codec search function, not the other two, and I don't want to change semantics, just its performance. |
|||
| msg165517 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-07-15 11:16 | |
> I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation. See changeset 54ef645d08e4. |
|||
| msg220630 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2014-06-15 13:02 | |
What's the status of this issue, as we've lived with this really slow implementation for well over three years? |
|||
| msg220633 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2014-06-15 13:19 | |
On 15.06.2014 15:02, Mark Lawrence wrote: > > What's the status of this issue, as we've lived with this really slow implementation for well over three years? I guess it just needs someone to write a patch. Note that encoding lookups are cached, so the slowness only becomes an issue if you lookup lots of different encodings. |
|||
| msg283266 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2016-12-15 09:34 | |
Thanks for the patch. Victor has implemented the function in C, AFAIK, so an even better approach would be to expose that function at the Python level and use it in the encodings package. |
|||
| msg283271 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-12-15 09:53 | |
It seems like encodings.normalize_encoding() currently has no unit test! Before modifying it, I would prefer to see a few unit tests: * " utf 8 " * "UtF 8" * "utf8\xE9" * etc. Since we are talking about an optimmization, I would like to see a benchmark result before/after. I also would like to test Marc-Andre's idea of exposing the C function _Py_normalize_encoding(). _Py_normalize_encoding() works on a byte string encoded to Latin1. To implement encodings.normalize_encoding(), we might rewrite the function to work on Py_UCS4 character, or have a fast version on char*, and a more generic version for UCS2 and UCS4? |
|||
| msg283333 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-12-15 16:11 | |
Oh, while reading Mercurial history, I found a note that I wrote: "It's not exactly the same than encodings.normalize_encoding(): the C function also converts to lowercase." IHMO it's fine to modify encodings.normalize_encoding() to also convert to lower-case. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:13 | admin | set | github: 55531 |
| 2022-01-24 23:38:06 | gregory.p.smith | set | nosy:
+ gregory.p.smith |
| 2016-12-15 16:11:31 | vstinner | set | messages: + msg283333 |
| 2016-12-15 09:53:01 | vstinner | set | messages: + msg283271 |
| 2016-12-15 09:34:52 | lemburg | set | messages:
+ msg283266 versions: + Python 3.7, - Python 3.4, Python 3.5 |
| 2016-12-15 09:27:10 | BreamoreBoy | set | nosy:
- BreamoreBoy |
| 2016-12-15 09:24:30 | methane | set | files:
+ encoding_normalize_optimize.patch keywords: + patch |
| 2014-06-15 13:19:56 | lemburg | set | messages: + msg220633 |
| 2014-06-15 13:02:58 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages:
+ msg220630 |
| 2012-07-15 11:16:20 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg165517 |
| 2011-03-01 16:55:06 | jcea | set | nosy:
+ jcea |
| 2011-02-26 10:09:45 | sdaoden | set | nosy:
+ sdaoden |
| 2011-02-25 23:06:49 | lemburg | set | nosy:
lemburg, belopolsky, vstinner, ezio.melotti messages: + msg129463 title: encoding package's normalize_encoding() function is too slow -> encoding package's normalize_encoding() function is too slow |
| 2011-02-25 23:03:06 | vstinner | set | nosy:
+ vstinner messages: + msg129460 |
| 2011-02-25 19:33:57 | belopolsky | link | issue11303 superseder |
| 2011-02-25 16:34:58 | belopolsky | set | nosy:
lemburg, belopolsky, ezio.melotti messages: + msg129389 |
| 2011-02-25 16:12:54 | ezio.melotti | set | nosy:
+ ezio.melotti, belopolsky |
| 2011-02-25 15:55:31 | lemburg | create | |
