Delete h2 until you reach the next h2 in beautifulsoup
rosefox911 at gmail.com
rosefox911 at gmail.com
Sun Nov 6 18:24:02 EST 2016
More information about the Python-list mailing list
Sun Nov 6 18:24:02 EST 2016
- Previous message (by thread): Delete h2 until you reach the next h2 in beautifulsoup
- Next message (by thread): Force virtualenv pip to be used
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sunday, November 6, 2016 at 1:27:48 AM UTC-4, rosef... at gmail.com wrote: > Considering the following html: > > <h2 id="example">cool stuff</h2> <ul> <li>hi</li> </ul> <div> <h2 id="cool"><h2> <ul><li>zz</li> </ul> </div> > > and the following list: > > ignore_list = ['example','lalala'] > > My goal is, while going through the HTML using Beautifulsoup, I find a h2 that has an ID that is in my list (ignore_list) I should delete all the ul and lis under it until I find another h2. I would then check if the next h2 was in my ignore list, if it is, delete all the ul and lis until I reach the next h2 (or if there are no h2s left, delete the ul and lis under the current one and stop). > > How I see the process going: you read all the h2s from up to down in the DOM. If the id for any of those is in the ignore_list, then delete all the ul and li under the h2 until you reach the NEXT h2. If there is no h2, then delete the ul and LI then stop. > > Here is the full HMTL I am trying to work with: http://pastebin.com/Z3ev9c8N > > I am trying to delete all the UL and lis after "See_also"How would I accomplish this in Python? I got it working with the following solution: #Remove content I don't want try: for element in body.find_all('h2'): current_h2 = element.get_text() current_h2 = current_h2.replace('[edit]','') #print(current_h2) if(current_h2 in ignore_list): if(element.find_next_sibling('div') != None): element.find_next_sibling('div').decompose() if(element.find_next_sibling('ul') != None): element.find_next_sibling('ul').decompose() except(AttributeError, TypeError) as e: continue
- Previous message (by thread): Delete h2 until you reach the next h2 in beautifulsoup
- Next message (by thread): Force virtualenv pip to be used
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list