Extracting xml from html
kyosohma at gmail.com
kyosohma at gmail.com
Tue Sep 18 15:33:40 EDT 2007
More information about the Python-list mailing list
Tue Sep 18 15:33:40 EDT 2007
- Previous message (by thread): Extracting xml from html
- Next message (by thread): Extracting xml from html
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sep 18, 1:56 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote: > kyoso... at gmail.com wrote: > > I am attempting to extract some XML from an HTML document that I get > > returned from a form based web page. For some reason, I cannot figure > > out how to do this. > > Here's a sample of the html: > > > <html> > > <body> > > lots of screwy text including divs and spans > > <Row status="o"> > > <RecordNum>1126264</RecordNum> > > <Make>Mitsubishi</Make> > > <Model>Mirage DE</Model> > > </Row> > > </body> > > </html> > > > What's the best way to get at the XML? Do I need to somehow parse it > > using the HTMLParser and then parse that with minidom or what? > > lxml makes this pretty easy: > > >>> parser = etree.HTMLParser() > >>> tree = etree.parse(the_file_or_url, parser) > > This is actually a tree that can be treated as XML, e.g. with XPath, XSLT, > tree iteration, ... You will also get plain XML when you serialise it to XML: > > >>> xml_string = etree.tostring(tree) > > Note that this doesn't add any namespaces, so you will not magically get valid > XHTML or something. You could rewrite the tags by hand, though. > > Stefan I got it to work with lxml. See below: def Parser(filename): parser = etree.HTMLParser() tree = etree.parse(r'path/to/nextpage.htm', parser) xml_string = etree.tostring(tree) events = ("recordnum", "primaryowner", "customeraddress") context = etree.iterparse(StringIO(xml_string), tag='') for action, elem in context: tag = elem.tag if tag == 'primaryowner': owner = elem.text elif tag == 'customeraddress': address = elem.text else: pass print 'Primary Owner: %s' % owner print 'Address: %s' % address Does this make sense? It works pretty well, but I don't really understand everything that I'm doing. Mike
- Previous message (by thread): Extracting xml from html
- Next message (by thread): Extracting xml from html
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list