bpo-31714: Improved regular expression documentation. by serhiy-storchaka · Pull Request #3907 · python/cpython

ezio-melotti

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments.
There are also a couple of paragraphs with long lines that you could rewrap.

lowercasing doesn't take the current locale into account; it will if you also
set the :const:`LOCALE` flag.
letters, too. Full Unicode matching also works unless the :const:`re.ASCII`
flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable non-ASCII matches makes it sounds like it won't match any non-ASCII characters, but that is false:

>>> re.match('负鼠', '负鼠', re.ASCII)
<_sre.SRE_Match object; span=(0, 2), match='负鼠'>

I think it would be more correct to say that regex sets will only match characters in the ASCII range.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy from existing re module documentation. :(

regex sets will only match characters in the ASCII range.

This doesn't sound good too. Regex sets can match characters outsides the ASCII range with the re.ASCII flag.

re.match('[耀-鿐]+', '负鼠', re.ASCII)
<re.Match object; span=(0, 2), match='负鼠'>

But case-insensitive matching works only in the ASCII range. 'é' doesn't match 'É' with the re.ASCII flag.

flag is also used to disable non-ASCII matches. ``[A-Z]`` will also match
letters 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131,
Latin small letter dotless i), 'ſ' (U+017f, Latin small letter long s) and
'K' (U+212a, Kelvin sign) in Unicode mode. ``Spam`` will match ``Spam``,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong: unless I'm mistaken [A-Z] should be limited to upper case ASCII letters, even with re.UNICODE.
Perhaps you meant to use \w?
It would also be better to specify what Unicode categories are matched, instead of just providing a few examples and letting the user figure it out from there.
I think this is already explained below, so a link or a mention to that section is fine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is about case-insensitive matching. 'S' matches both 's' and 'ſ'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests:

>>> unichars = ''.join(chr(cp) for cp in range(0x110000))
>>> ''.join(re.findall('[a-z]', bmp))
'abcdefghijklmnopqrstuvwxyz'
>>> ''.join(re.findall('[A-Z]', bmp))
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> ''.join(re.findall('[a-z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'
>>> ''.join(re.findall('[A-Z]', bmp, re.I))
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK'

Now I understand what the paragraph means: [a-z] and [A-Z] will only match ASCII lower and upper case letters respectively (so 26 chars each), however if re.I is used with either one, since those 4 letters (and only those 4) are valid capitalization of the 26 ASCII letters, they will be matched as well (bringing the total up to 26 + 26 + 4 == 56).
Do you think we should rephrase it to make it clearer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to make it clearer. Now, since you understand what the paragraph means, could you please suggest a clear wording?

This example is not artificial. See bpo-31672. I think this caveat should be documented specially.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the patterns ``[a-z]`` or ``[A-Z]`` are used in combination with the re.I and re.U flag,
they will match the 52 ASCII letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin
capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F,
Latin small letter long s) and 'K' (U+212A, Kelvin sign).
that take account of language differences. For example, if you're
processing encoded French text, you'd want to be able to write ``\w+`` to
match words, but ``\w`` only matches the character class ``[A-Za-z]`` in
bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the '...' here and perhaps add b'\xe9' and b'\xe7' within parenthesis.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why b'\xe9' and b'\xe7' should be matched. But 'é' and 'ç' are French letters, and I have added "bytes corresponding to" for making this phrase Python 3 compatible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I suggested that, is because 'é' and 'ç' are Unicode str in Python 3, whereas without quotes they are just letters. The addition of b'\xe9' and b'\xe7' might help clarify what is being matched, but it's not essential.

bytes patterns; it won't match bytes corresponding to ``'é'`` or ``'ç'``.
If your system is configured properly and a French locale is selected,
certain C functions will tell the program that the byte corresponding
``'é'`` should also be considered a letter.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corresponding to é
I also don't like certain C functions will tell the program too much.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest remove quotes?

certain C functions will tell the program already was here, it looks correct to me, and I don't know how improve it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove quotes, since we are talking about the character, not about a Unicode string.
Fair enough about the wording being there already.

is very unreliable, and it only handles one "culture" at a time anyway;
and it only works with 8-bit locales;
you should use Unicode matching instead, which is the default in Python 3
for Unicode (str) patterns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... is very unreliable, it only handles one "culture" at a time, and it only works with 8-bit locales. You should use ...

Since Unicode matching is the default, I wouldn't say "you should use", but just something like "Unicode matching is already enabled by default in Python 3, and it is able to handle different locales/languages."

Alternation, or the "or" operator. If A and B are regular expressions,
``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very
Alternation, or the "or" operator. If *A* and *B* are regular expressions,
``A|B`` will match any string that matches either *A* or *B*. ``|`` has very

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why * instead of ``?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because A and B are not literal regexpes, but variables.

matches.

match lowercase letters. Full Unicode matching (such as ``Ü`` matching
``ü``) also works unless the :const:`re.ASCII` flag is also used to disable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is also used to
(you already have one before)

only letters 'A' to 'Z' and 'a' to 'z', but will also match letters 'İ'
(U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small
letter dotless i), 'ſ' (U+017f, Latin small letter long s) and 'K' (U+212a,
Kelvin sign). If the :const:`ASCII` flag is used, only letters 'a' to 'z'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated from above, so the same comments applied (maybe it shouldn't be duplicated?).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you suggest how to avoid duplication?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice the duplicated part was on two separate files, so it's probably ok to leave it.

you should use Unicode matching instead, which is the default in Python 3
for Unicode (str) patterns. This flag can be used only with bytes patterns.
for Unicode (str) patterns.
Correcsponds the inline flag ``(?L)``.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcsponds to
(extra c and missing to)

Also applies below.


>>> int_re = r'\d+'
>>> print(re.sub('INT', int_re.replace('\\', r'\\'), r'INT(\.INT)?(e[+-]?INT)?'))
\d+(\.\d+)?(e[+-]?\d+)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this example particularly clear. Why would someone want to use re.escape() on the replacement string? Wouldn't using int_re = r'\\d+' (and a normal str.replace on INT) be easier?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to provide simplified real-word example. In Mailman re.sub() is used for creating a regular expression. They passed the pattern containing a \d as a replacement string and got an error when this became invalid. Someone could use re.escape() on the replacement string, because the replacement string looks similar to simple pattern (it expands \n and \1). And this will work while the replacement string don't contain other metacharacters except a backslash.

I'll replace this example with the better one.