Matching XML Tag Contents with Regex
Diez B. Roggisch
deets at nospam.web.de
Tue Dec 11 13:08:05 EST 2007
More information about the Python-list mailing list
Tue Dec 11 13:08:05 EST 2007
- Previous message (by thread): Why does producer delay halt shell pipe?
- Next message (by thread): Matching XML Tag Contents with Regex
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Chris wrote: > On Dec 11, 11:41 am, garage <xmikeda... at gmail.com> wrote: >> > Is what I'm trying to do possible with Python's Regex library? Is >> > there an error in my Regex? >> >> Search for '*?' onhttp://docs.python.org/lib/re-syntax.html. >> >> To get around the greedy single match, you can add a question mark >> after the asterisk in the 'content' portion the the markup. This >> causes it to take the shortest match, instead of the longest. eg >> >> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]* >> >> There's still some funkiness in the regex and logic, but this gives >> you the three matches > > Thanks, that's pretty close to what I was looking for. How would I > filter out tags that don't have certain text in the contents? I'm > running into the same issue again. For instance, if I use the regex: > > <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(% > (tagName)s)]* > > each match will include "targettext". However, some matches will still > include </%(tagName)s)>, presumably from the tags which didn't contain > targettext. Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse & access HTML. Diez
- Previous message (by thread): Why does producer delay halt shell pipe?
- Next message (by thread): Matching XML Tag Contents with Regex
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list