Strip HTML tags from downloaded files
Walter Dörwald
walter at livinglogic.de
Thu Dec 6 08:02:52 EST 2001
More information about the Python-list mailing list
Thu Dec 6 08:02:52 EST 2001
- Previous message (by thread): Strip HTML tags from downloaded files
- Next message (by thread): Distant learning Python classes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thomas Pham wrote: > When I use urlretrieve to download a file from the web, the raw text file have HTML tags embedded at the beginning and the end of the file. > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> > <HTML> > <HEAD> > > > > </PRE> > </BODY></HTML> > > Is there anyway to strip all the HTML tags from the file? You could do this with XIST (http://www.livinglogic.de/Python/xist/) Code might look like this: from xist import parsers from xist.ns import html doc = parsers.parseTidyURL("http://www.freshmeat.net/", defaultEncoding="latin-1") string = doc.find(type=html.pre, searchchildren=1)[0].asPlainString() print string HTH, Walter Dörwald
- Previous message (by thread): Strip HTML tags from downloaded files
- Next message (by thread): Distant learning Python classes?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list