Issue 29995: re.escape() escapes too much
Created on 2017-04-05 14:17 by serhiy.storchaka, last changed 2019-01-31 14:58 by LtWorf. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 1007 | merged | serhiy.storchaka, 2017-04-05 14:23 | |
| PR 2114 | merged | terry.reedy, 2017-06-11 17:35 | |
| Messages (5) | |||
|---|---|---|---|
| msg291177 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-04-05 14:17 | |
re.escape() escapes all the characters except ASCII letters, numbers and '_'. This is too excessive, makes escaping and compiling slower and makes the pattern less human-readable. Characters "!\"%&\',/:;<=>@_`~" as well as non-ASCII characters are always literal in a regular expression and don't need escaping.
Proposed patch makes re.escape() escaping only minimal set of characters that can have special meaning in regular expressions. This includes special characters ".\\[]{}()*+?^$|", "-" (a range in a character set), "#" (starts a comment in verbose mode) and ASCII whitespaces (ignored in verbose mode).
The null character no longer need a special escaping.
The patch also increases the speed of re.escape() (even if it produces the same result).
$ ./python -m perf timeit -s 'from re import escape; s = "()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 42.2 us +- 0.8 us
Patched: Median +- std dev: 11.4 us +- 0.1 us
$ ./python -m perf timeit -s 'from re import escape; s = b"()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 38.7 us +- 0.7 us
Patched: Median +- std dev: 18.4 us +- 0.2 us
$ ./python -m perf timeit -s 'from re import escape; s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 40.3 us +- 0.5 us
Patched: Median +- std dev: 33.1 us +- 0.6 us
$ ./python -m perf timeit -s 'from re import escape; s = b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 54.4 us +- 0.7 us
Patched: Median +- std dev: 40.6 us +- 0.5 us
$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 156 us +- 3 us
Patched: Median +- std dev: 43.5 us +- 0.5 us
$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode()' -- --duplicate 100 'escape(s)'
Unpatched: Median +- std dev: 200 us +- 4 us
Patched: Median +- std dev: 77.0 us +- 0.6 us
And the speed of compilation of escaped string.
$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"; p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched: Median +- std dev: 1.96 ms +- 0.02 ms
Patched: Median +- std dev: 1.16 ms +- 0.02 ms
$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode(); p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched: Median +- std dev: 3.69 ms +- 0.04 ms
Patched: Median +- std dev: 2.13 ms +- 0.03 ms
|
|||
| msg291624 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-04-13 18:06 | |
New changeset 5908300e4b0891fc5ab8bd24fba8fac72012eaa7 by Serhiy Storchaka in branch 'master': bpo-29995: re.escape() now escapes only special characters. (#1007) https://github.com/python/cpython/commit/5908300e4b0891fc5ab8bd24fba8fac72012eaa7 |
|||
| msg295723 - (view) | Author: Terry J. Reedy (terry.reedy) * ![]() |
Date: 2017-06-11 17:50 | |
New changeset a895f91a46c65a6076e8c6a28af0df1a07ed60a2 by terryjreedy in branch '3.6': [3.6]bpo-29995: Adjust IDLE test for 3.7 re.escape change [GH-1007] (#2114) https://github.com/python/cpython/commit/a895f91a46c65a6076e8c6a28af0df1a07ed60a2 |
|||
| msg295727 - (view) | Author: Terry J. Reedy (terry.reedy) * ![]() |
Date: 2017-06-11 18:32 | |
Serhiy, please nosy me when you change idlelib files. |
|||
| msg334629 - (view) | Author: Salvo Tomaselli (LtWorf) | Date: 2019-01-31 14:58 | |
Aaaand this broke my unit tests when moving from 3.6 to 3.7! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2019-01-31 14:58:48 | LtWorf | set | nosy:
+ LtWorf messages: + msg334629 |
| 2017-06-11 18:32:41 | terry.reedy | set | messages:
+ msg295727 versions: + Python 3.6 |
| 2017-06-11 17:50:53 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg295723 |
| 2017-06-11 17:35:55 | terry.reedy | set | pull_requests: + pull_request2167 |
| 2017-04-13 18:14:26 | serhiy.storchaka | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2017-04-13 18:06:45 | serhiy.storchaka | set | messages: + msg291624 |
| 2017-04-12 09:48:16 | serhiy.storchaka | set | assignee: serhiy.storchaka dependencies: + Add examples for re.escape() |
| 2017-04-05 14:23:39 | serhiy.storchaka | set | pull_requests: + pull_request1175 |
| 2017-04-05 14:17:51 | serhiy.storchaka | create | |
