Issue 1566086: RE (regular expression) matching stuck in loop
Created on 2006-09-27 01:23 by fabinator, last changed 2022-04-11 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re-loop.py | akuchling, 2006-10-26 19:38 | Test script | ||
| re-loop.py | akuchling, 2006-10-26 19:58 | Correct version -- remove some debugging hacks | ||
| Messages (6) | |||
|---|---|---|---|
| msg30013 - (view) | Author: Fabien Devaux (fabinator) | Date: 2006-09-27 01:23 | |
See the code: http://pastebin.ca/183613 the "finditer()" call don't seems to return. Playing with the re can bypass the problem but it looks like a bug. (I'm really sorry if I did something wrong and didn't notice) note: I can reproduce it with python2.5 |
|||
| msg30014 - (view) | Author: A.M. Kuchling (akuchling) * ![]() |
Date: 2006-10-26 19:38 | |
Logged In: YES user_id=11375 Attaching the test script. I've modified it to save a copy of the HTML page's data so that running the example doesn't require accessing a slow web site repeatedly. |
|||
| msg30015 - (view) | Author: A.M. Kuchling (akuchling) * ![]() |
Date: 2006-10-26 19:53 | |
Logged In: YES user_id=11375 I haven't dug very far into the code, but suspect this isn't a bug in the regex code. The pattern uses lots of .*? subpatterns, and this often means the pattern takes a long time to fail if it isn't going to match. The regex engine matches the <link> group, and then there's a .*?, followed by <b>. The engine looks at every character and if it sees a <b>, tries another .*?. This is O(n**2) where n is the number of character in the string being searched, and that string is 93,000 characters long. If you limit the string to 5K or so, the match fails pretty quickly. I strongly suggest working with the HTML. You could run the HTML through tidy to convert to XHTML and use ElementTree on the resulting XML. |
|||
| msg59628 - (view) | Author: Ralf Schmitt (schmir) | Date: 2008-01-09 21:37 | |
With python 2.6 the program can be interrupted with ctrl-c (see http://bugs.python.org/issue846388). I think this one should be closed as a duplicate of: http://bugs.python.org/issue1662581 |
|||
| msg59632 - (view) | Author: Georg Brandl (georg.brandl) * ![]() |
Date: 2008-01-09 22:37 | |
Done. |
|||
| msg81478 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2009-02-09 19:40 | |
This problem has been addressed in issue #2636. Extra checks have been added to reduce the amount of backtracking. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:56:20 | admin | set | github: 44036 |
| 2009-02-09 19:40:28 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg81478 |
| 2008-01-09 22:37:50 | georg.brandl | set | status: open -> closed resolution: duplicate superseder: the re module can perform poorly: O(2**n) versus O(n**2) messages: + msg59632 nosy: + georg.brandl |
| 2008-01-09 21:37:12 | schmir | set | nosy:
+ schmir messages: + msg59628 |
| 2006-09-27 01:23:22 | fabinator | create | |

