Parsing HTML - modify URLs
Fuzzyman
michael at foord.net
Thu Jul 8 04:52:18 EDT 2004
More information about the Python-list mailing list
Thu Jul 8 04:52:18 EDT 2004
- Previous message (by thread): Parsing HTML - modify URLs
- Next message (by thread): Python indentation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
richard <richardjones at optushome.com.au> wrote in message news:<40ec817a$0$25460$afc38c87 at news.optusnet.com.au>... > > michael at foord.net (Fuzzyman) writes: > >> "Robert Brewer" <fumanchu at amor.org> wrote in message > >> news:<mailman.69.1089211879.5135.python-list at python.org>... > >> > Haven't used it, but Beautiful Soup sounds like it fits the bill: > >> > > >> > http://www.crummy.com/software/BeautifulSoup/ > >> > >> It talks about 'walkin the parse tree'... which is a bit more magic > >> than I want... I just want to modify URLs in tags... which means I > >> mainly want to extract the HTML unchanged and also modify a few tags - > >> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I > >> may have to try beautiful soup though :-) > > From the BeautifulSoup page: > > "You can modify a Tag or NavigableText in place. Printing it out as a > string will print the new markup text." > > And really, it handles *any* HTML, no matter how crappy - I'm using it to > deal with pages that have random <span> and </span> in them with no > matching end / start tags. Eugh. > > Once you've written rewrite_url(), this will do the job on the BeautifulSoup > side: > > soup = BeautifulSoup() > soup.feed(source_html) > for tag, attr in (('img', 'src'), ('a', 'href')): > for tag in soup(tag): > if tag.get(attr): > tag[attr] = rewrite_url(tag[attr]) > print soup > > > Richard Haha - just switched to BS and so far it works like a dream... building a CGI proxy for escaping restricted/censored internet environments... Thanks for the help. Regards, Fuzzy http://www.voidspace.org.uk/atlantibots/pythonutils.html
- Previous message (by thread): Parsing HTML - modify URLs
- Next message (by thread): Python indentation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list