HTMLparsing abnormal html pages
asle at spam.com
asle at spam.com
Thu Mar 15 03:50:41 EST 2001
More information about the Python-list mailing list
Thu Mar 15 03:50:41 EST 2001
- Previous message (by thread): HTMLparsing abnormal html pages
- Next message (by thread): HTMLparsing abnormal html pages
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Considering the small program below. Running it will show that the HTMLparser is truncating urls in the HTML page. Now, most of you will probably say that the page and in particular the URL's of this page are not valid according to the RFC1738 protocol --bad luck. But there must be a work-around for this? import htmllib import urllib import formatter url='http://di.se/Scripts/Sections/allarticles.asp' parser=htmllib.HTMLParser(formatter.NullFormatter()) parser.feed(urllib.urlopen(url).read()) parser.close() urls=parser.anchorlist print urls One solution is of course to preprosess the whole HTML page and replacing invalid URL's whith valid URL's (using regex??), however I have also tried to look into HTMLparser and the formatter to see what can be done there to correct the problem, but with no sucess on the latter. Any comments on what to do? /Asle ----- Posted via NewsOne.Net: Free (anonymous) Usenet News via the Web ----- http://newsone.net/ -- Free reading and anonymous posting to 60,000+ groups NewsOne.Net prohibits users from posting spam. If this or other posts made through NewsOne.Net violate posting guidelines, email abuse at newsone.net
- Previous message (by thread): HTMLparsing abnormal html pages
- Next message (by thread): HTMLparsing abnormal html pages
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list