Issue3665
Created on 2008-08-24 20:33 by georg.brandl, last changed 2022-04-11 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re_unicode_escapes.diff | georg.brandl, 2008-08-24 20:33 | |||
| 3665.patch | ishimoto, 2010-07-11 05:09 | |||
| re_unicode_escapes.diff | serhiy.storchaka, 2012-06-01 06:43 | Regenerate georg.brandl's patch for review | review | |
| 3665.patch | serhiy.storchaka, 2012-06-01 06:44 | Regenerate ishimoto's patch for review | review | |
| re_unicode_escapes-2.patch | serhiy.storchaka, 2012-06-17 12:48 | + PEP 393, + cleanup, + tests | review | |
| re_unicode_escapes-3.patch | serhiy.storchaka, 2012-06-18 08:02 | + byte patterns, + tests, + docs | review | |
| Messages (14) | |||
|---|---|---|---|
| msg71861 - (view) | Author: Georg Brandl (georg.brandl) * ![]() |
Date: 2008-08-24 20:33 | |
Since \u and \U aren't interpolated in raw strings anymore, the re module should support those escapes in addition to the \x and octal ones it already does. Attached patch. |
|||
| msg71864 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2008-08-24 20:49 | |
- Check that it also works for chars > 0xFFFF (even in UCS2 builds, at least when the chars are not part of [character range]) - What does happen with e.g. [\U00010000-\U00010001] on an UCS build? |
|||
| msg71865 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2008-08-24 20:49 | |
(in the last sentence, I meant UCS2. Sorry) |
|||
| msg71868 - (view) | Author: Georg Brandl (georg.brandl) * ![]() |
Date: 2008-08-24 20:58 | |
These concerns indeed must be handled: On narrow unicode builds, chars >
0xffff must be converted to surrogates. In ranges, they should raise an
error.
Additionally, this should at least raise an error too:
>>> re.compile("[\U00100000]").match("\U00100000").group()
'\udbc0'
|
|||
| msg109961 - (view) | Author: Atsuo Ishimoto (ishimoto) * | Date: 2010-07-11 05:09 | |
Here's an updated patch for py3k branch. As per Georg's comment, I added to check codepoint in the character ranges, conversion to the surrogate pairs. I also added check to raise exception if codepoint > 0x10ffff. I with to English speakers to fix error messages in the patch. |
|||
| msg138219 - (view) | Author: Éric Araujo (eric.araujo) * ![]() |
Date: 2011-06-12 20:30 | |
FYI,
+ raise error("bogus escape: %s" % repr(escape))
can be written simply as
+ raise error("bogus escape: %r" % escape)
|
|||
| msg162052 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-01 06:25 | |
I don't think it is worth to target it for 2.7 and 3.2 (it's new feature, not bugfix), but for 3.3 it will be very useful. Since PEP 393 conversion to the surrogate pairs is no longer relevant. |
|||
| msg162830 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-14 21:23 | |
Georg, Atsuo, how are you? |
|||
| msg163065 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-17 12:48 | |
Here is updated (in conforming with PEP 393) patch. In additional octal and hexadecimal escaping cleared, illegal error message for hexadecimal escaping fixed. Added new tests for octal and hexadecimal escaping. |
|||
| msg163094 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-18 08:02 | |
I forgot about byte patterns. Here is an updated patch. |
|||
| msg163580 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-23 11:23 | |
Any chance to commit the patch today and to get this feature in Python 3.3? |
|||
| msg163584 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2012-06-23 11:32 | |
New changeset b1dbd8827e79 by Antoine Pitrou in branch 'default': Issue #3665: \u and \U escapes are now supported in unicode regular expressions. http://hg.python.org/cpython/rev/b1dbd8827e79 |
|||
| msg163585 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-06-23 11:33 | |
> Any chance to commit the patch today and to get this feature in Python > 3.3? Thanks for reminding us! It's now in 3.3. |
|||
| msg163590 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-06-23 11:48 | |
Thank you for the quick response. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:56:38 | admin | set | github: 47915 |
| 2012-06-23 11:48:13 | serhiy.storchaka | set | messages: + msg163590 |
| 2012-06-23 11:33:41 | pitrou | set | status: open -> closed resolution: fixed messages: + msg163585 stage: commit review -> resolved |
| 2012-06-23 11:32:54 | python-dev | set | nosy:
+ python-dev messages: + msg163584 |
| 2012-06-23 11:28:36 | pitrou | set | assignee: pitrou stage: patch review -> commit review |
| 2012-06-23 11:23:04 | serhiy.storchaka | set | messages: + msg163580 |
| 2012-06-18 08:02:24 | serhiy.storchaka | set | files:
+ re_unicode_escapes-3.patch messages: + msg163094 |
| 2012-06-17 12:48:06 | serhiy.storchaka | set | files:
+ re_unicode_escapes-2.patch messages: + msg163065 |
| 2012-06-14 21:23:59 | serhiy.storchaka | set | messages: + msg162830 |
| 2012-06-01 06:44:38 | serhiy.storchaka | set | files: + 3665.patch |
| 2012-06-01 06:43:52 | serhiy.storchaka | set | files: + re_unicode_escapes.diff |
| 2012-06-01 06:37:02 | serhiy.storchaka | set | files: - 3665.patch |
| 2012-06-01 06:36:47 | serhiy.storchaka | set | files: - re_unicode_escapes.diff |
| 2012-06-01 06:36:02 | serhiy.storchaka | set | files: + 3665.patch |
| 2012-06-01 06:35:08 | serhiy.storchaka | set | files: + re_unicode_escapes.diff |
| 2012-06-01 06:25:29 | serhiy.storchaka | set | versions:
- Python 2.7, Python 3.2 nosy: + serhiy.storchaka messages: + msg162052 components:
+ Regular Expressions, Unicode |
| 2011-11-29 06:16:10 | ezio.melotti | set | nosy:
+ mrabarnett |
| 2011-07-21 05:14:12 | ezio.melotti | set | keywords:
+ needs review stage: patch review |
| 2011-06-12 20:30:55 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg138219 |
| 2011-06-12 18:32:20 | terry.reedy | set | versions: + Python 3.2, Python 3.3, - Python 3.1 |
| 2010-08-04 14:38:30 | ezio.melotti | set | nosy:
+ ezio.melotti |
| 2010-07-11 05:09:51 | ishimoto | set | files:
+ 3665.patch nosy: + ishimoto messages: + msg109961 |
| 2008-09-27 14:27:18 | timehorse | set | versions: + Python 3.1, Python 2.7, - Python 3.0 |
| 2008-09-27 14:20:42 | timehorse | set | nosy: + timehorse |
| 2008-08-24 20:58:27 | georg.brandl | set | messages: + msg71868 |
| 2008-08-24 20:49:33 | pitrou | set | messages: + msg71865 |
| 2008-08-24 20:49:11 | pitrou | set | nosy:
+ pitrou messages: + msg71864 |
| 2008-08-24 20:33:51 | georg.brandl | create | |

