Beautiful Soup Looping Extraction Question
Paul McGuire
ptmcg at austin.rr.com
Mon Mar 24 19:55:41 EDT 2008
More information about the Python-list mailing list
Mon Mar 24 19:55:41 EDT 2008
- Previous message (by thread): Beautiful Soup Looping Extraction Question
- Next message (by thread): Beautiful Soup Looping Extraction Question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mar 24, 6:32 pm, Tess <test... at gmail.com> wrote: > Hello All, > > I have a Beautiful Soup question and I'd appreciate any guidance the > forum can provide. > I *know* you're using Beautiful Soup, and I *know* that BS is the de facto HTML parser/processor library. Buuuuuut, I just couldn't help myself in trying a pyparsing scanning approach to your problem. See the program below for a pyparsing treatment of your question. -- Paul """ My goal is to extract all elements where the following is true: <p align="left"> and <div align="center">. """ from pyparsing import makeHTMLTags, withAttribute, keepOriginalText, SkipTo p,pEnd = makeHTMLTags("P") p.setParseAction( withAttribute(align="left") ) div,divEnd = makeHTMLTags("DIV") div.setParseAction( withAttribute(align="center") ) # basic scanner for matching either <p> or <div> with desired attrib value patt = ( p + SkipTo(pEnd) + pEnd ) | ( div + SkipTo(divEnd) + divEnd ) patt.setParseAction( keepOriginalText ) print "\nBasic scanning" for match in patt.searchString(html): print match[0] # simplified data access, by adding some results names patt = ( p + SkipTo(pEnd)("body") + pEnd )("P") | \ ( div + SkipTo(divEnd)("body") + divEnd )("DIV") patt.setParseAction( keepOriginalText ) print "\nSimplified field access using results names" for match in patt.searchString(html): if match.P: print "P -", match.body if match.DIV: print "DIV -", match.body Prints: Basic scanning <p align="left">P1</p> <div align="center">div2a</div> <div align="center">div2b</div> <p align="left">P3</p> <div align="center">div3b</div> <p align="left">P4</p> <div align="center">div4b</div> Simplified field access using results names P - P1 DIV - div2a DIV - div2b P - P3 DIV - div3b P - P4 DIV - div4b
- Previous message (by thread): Beautiful Soup Looping Extraction Question
- Next message (by thread): Beautiful Soup Looping Extraction Question
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list