Logged In: YES
user_id=11375
I haven't dug very far into the code, but suspect this isn't
a bug in the regex code.
The pattern uses lots of .*? subpatterns, and this often
means the pattern takes a long time to fail if it isn't
going to match. The regex engine matches the <link> group,
and then there's a .*?, followed by <b>. The engine looks
at every character and if it sees a <b>, tries another .*?.
This is O(n**2) where n is the number of character in the
string being searched, and that string is 93,000 characters
long. If you limit the string to 5K or so, the match fails
pretty quickly.
I strongly suggest working with the HTML. You could run the
HTML through tidy to convert to XHTML and use ElementTree on
the resulting XML.
|