Issue6561
Created on 2009-07-24 10:48 by mark.dickinson, last changed 2022-04-11 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue6561.patch | mark.dickinson, 2009-07-24 16:36 | |||
| Messages (8) | |||
|---|---|---|---|
| msg90878 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2009-07-24 10:47 | |
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that
the regex r'\d' matches all unicode characters with category either 'Nd'
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in
category 'Nl' (Number, Letter):
Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>
I believe (but am not 100% sure) that r'\d' should only match characters
in category 'Nd'. To back up this belief:
(1) int and float currently accept characters in category 'Nd' but not
'No'; it would seem useful for '\d' to match those characters that are
accepted by int, so that e.g., something matched with '\d+' could be
directly passed to int. (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;
that's a separate issue, though.)
(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches
only characters in category 'Nd'
(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should
correspond to \p{gc=Decimal_Number}
Marc-André, do you have any opinion on this?
It's probably slightly dangerous to change this in 2.6 or 3.1; I'm
proposing that '\d' should be modified to accept only characters of
category 'Nd' in 2.7 and 3.2.
(Thanks Ezio Melotti for finding all the references above and doing Perl
testing!)
|
|||
| msg90885 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2009-07-24 14:51 | |
Patch against py3k. |
|||
| msg90888 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2009-07-24 16:36 | |
New patch; same as before, but includes clarification to the documentation. |
|||
| msg90927 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-07-25 17:23 | |
This sounds reasonable to me. |
|||
| msg90929 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2009-07-25 18:01 | |
This seems to me quite redundant:
+ Matches any Unicode decimal digit; more specifically, matches
+ any character in Unicode category [Nd] (Number, Decimal Digit).
+ This includes ``[0-9]``, and also many other digit characters.
I suggest something like:
Matches the decimal digits ``[0-9]`` and all the characters that belong
to the Unicode category Nd (Number, Decimal Digit).
Two more minor details: instead of '\d', I'd use '^\d$' and instead of
self.assertEqual(re.match('\d', x), None)
self.assertIsNone(re.match('\d', x)).
|
|||
| msg90971 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2009-07-27 02:23 | |
It may be redundant, but it is also more technically accurate. I'm -0 on your proposed rephrasing, and trust Mark to make the right decision :) |
|||
| msg91012 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2009-07-28 17:23 | |
[ezio.melotti]
> I suggest something like:
> Matches the decimal digits ``[0-9]`` and all the characters that belong
> to the Unicode category Nd (Number, Decimal Digit).
Hmm. I don't like this because it suggests (to me) that the characters
[0-9] don't belong to category [Nd]. I agree the previous version was
clunky, though. I've shortened it some; if anyone else wants to work on
the wording please feel free. It might be nice to annotate each of these
character classes (\w, \s) with the Unicode character categories that they
correspond to.
> Two more minor details: instead of '\d', I'd use '^\d$' and instead of
> self.assertEqual(re.match('\d', x), None)
> self.assertIsNone(re.match('\d', x)).
Thanks. Changes applied.
Committed to py3k, r74237. Leaving open for backport to trunk.
|
|||
| msg91018 - (view) | Author: Mark Dickinson (mark.dickinson) * ![]() |
Date: 2009-07-28 21:24 | |
Backported to trunk in r74240. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:56:51 | admin | set | github: 50810 |
| 2009-07-28 21:24:48 | mark.dickinson | set | status: open -> closed messages: + msg91018 |
| 2009-07-28 17:23:36 | mark.dickinson | set | stage: patch review -> resolved messages: + msg91012 versions: - Python 3.2 |
| 2009-07-27 02:23:07 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg90971 |
| 2009-07-25 18:01:50 | ezio.melotti | set | priority: normal keywords: + needs review messages: + msg90929 stage: test needed -> patch review |
| 2009-07-25 17:23:37 | pitrou | set | nosy:
+ pitrou messages: + msg90927 |
| 2009-07-24 16:36:43 | mark.dickinson | set | files: - issue6561.patch |
| 2009-07-24 16:36:30 | mark.dickinson | set | files:
+ issue6561.patch messages: + msg90888 |
| 2009-07-24 14:51:50 | mark.dickinson | set | files:
+ issue6561.patch keywords: + patch messages: + msg90885 |
| 2009-07-24 11:58:04 | eric.smith | set | nosy:
+ eric.smith |
| 2009-07-24 10:48:00 | mark.dickinson | create | |
