HTML parsing/scraping & python
Mike Meyer
mwm at mired.org
Wed Nov 30 23:15:26 EST 2005
More information about the Python-list mailing list
Wed Nov 30 23:15:26 EST 2005
- Previous message (by thread): HTML parsing/scraping & python
- Next message (by thread): Debugging functionality for embedded Python
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Sanjay Arora <sanjay.k.arora at gmail.com> writes: > We are looking to select the language & toolset more suitable for a > project that requires getting data from several web-sites in real- > time....html parsing/scraping. It would require full emulation of the > browser, including handling cookies, automated logins & following > multiple web-link paths. Multiple threading would be a plus but not > requirement. Believe it or not, everything you ask for can be done by Python out of the box. But there are limitations. For one, the HTML parsing module that comes with Python doesn't handle invalid HTML very well. Thanks to Netscape, invalid HTML is the rule rather than the exception on the web. So you probably want to use a third party module for that. I use BeautifulSoup, which handles XML, HTML, has a *lovely* API (going from BeautifulSoup to DOM is always a major dissapointment), and works well with broken X/HTML. That sufficient for my needs, but I haven't been asked to do a lot of automated form filling, so the facilities in the standard library work for me. There are third party tools to help with that. I'm sure someone willsuggest them. > Can you suggest solutions for python? Pros & Cons using Perl vs. Python? > Why Python? Because it's beautiful. Seriously, Python code is very readable, by design. Of course, some of the features that make that happen drive some people crazy. If you're one of them, then Python isn't the language for you. <mike -- Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
- Previous message (by thread): HTML parsing/scraping & python
- Next message (by thread): Debugging functionality for embedded Python
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list