issue2636-24 : Code : Python

lp:~pythonregexp2.7/python/issue2636-24

Created by TimeHorse and last modified

Currently, the python Regular Expression Engine drops characters when used findall / finditer with an expression that has a Zero-Width capture group. For example:

>>> [m.groups() for m in re.finditer(r'(^z*)|(\w+)', 'abc')]
[('', None), (None, 'bc')]

The 'a' has been lost because the engine first matches the (^z*) with zero-width and then consumes the current character (the 'a'). It then proceeds to match the rest of the expression, which it does with (\w+), resulting in 'bc'. The problem is that firstly, the 'a' should not be consumed by the zero-width match (^z*). But, that would lead to infinite matches of zero-width. So, secondly, one would have to give each iteration an internal state that would indicate whether the it would allow a Zero-width match. Initially, any string will match a Zero-Width expression once, but when that same position is entered, the 'Zero-width match' flag would be true and a subsequent Zero-width match would be disallowed. This item is based on the work from Issue 1647489.

Get this branch:
bzr branch lp:~pythonregexp2.7/python/issue2636-24

Branch merges

Related bugs

Related blueprints

Branch information

Recent revisions

39039. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39038. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39037. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39036. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39035. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39034. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39033. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39032. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39031. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>
39030. By Jeffrey C. "The TimeHorse" Jacobs <email address hidden>

Branch metadata

Branch format:
Branch format 6
Repository format:
Bazaar pack repository format 1 with rich root (needs bzr 1.0)