difficult regular expression
Chermside, Michael
mchermside at ingdirect.com
Wed Oct 30 10:50:53 EST 2002
More information about the Python-list mailing list
Wed Oct 30 10:50:53 EST 2002
- Previous message (by thread): For information on suggesting changes.
- Next message (by thread): difficult regular expression
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> It's easy to grab the section containing the text "Cats ... foods." (not > including the Chickens section.) However, I need to get just the items: > mice, rats, rabbits, marmots. Not all regular expression parsers are equivalent. Actually, that's not true... mathematically, they ARE equivalent in important ways (and some aren't), but for practical purposes, RE engines fall into two categories: those that support basic grep features (plus-or-minus a few features) and those that support perl 5 features (plus-or-minus a few features). Python supports the perl 5 features (give-or-take a few features). Two of those features are "look-ahead" and "look-behind" assertions. these are zero-width assertions that make the RE match (or not match) based on whether some (subsidiary) RE matches the string starting/ending at that point... but the text matched by the subsidiary RE does NOT become part of the match of the main RE. If you find this description confusing, check out the docs: http://python.org/doc/current/lib/re-syntax.html So I'm going to try to use these to create a RE that solves your problem. (The following is a transcript of my python session, but with the typos and mistakes taken out.) >>> text = """ ... Here is a list of foods and consumers: Dogs eat <chicken>, <rice>, ... <steak>, and other foods. Cats eat <mice>, <rats>, <rabbits>, <marmots>, ... and other foods. Chickens eat <grain>, <corn>, <wheat>, and other ... foods. Wow, that's a lot of eating! ... """ >>> # The text is really all one long line >>> text = text.replace('\n',' ') >>> re_1 = re.compile(r'Cats eat ((<.*?>, )+)and other foods.') >>> matchObj = re_1.search(text) >>> matchObj.groups()[0] '<mice>, <rats>, <rabbits>, <marmots>, ' >>> # Okay, that worked. In fact, it might be enough to >>> # solve the whole problem. But we'll try it with the >>> # look-ahead/behind assertions anyway. >>> re_2 = re.compile(r'(?<=Cats eat )((<.*?>, )+)(?=and other foods.)') >>> matchObj = re_2.search(text) >>> # now let's see what the entire pattern matched >>> matchObj.group(0) '<mice>, <rats>, <rabbits>, <marmots>, ' >>> # Yep... it works. Does that help? -- Michael Chermside
- Previous message (by thread): For information on suggesting changes.
- Next message (by thread): difficult regular expression
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list