Found a parsing bug in HTMLParser
Bengt Richter
bokr at oz.net
Sun Feb 9 16:38:36 EST 2003
More information about the Python-list mailing list
Sun Feb 9 16:38:36 EST 2003
- Previous message (by thread): One other possibilty for PEP 308...
- Next message (by thread): Found a parsing bug in HTMLParser
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz <gradha at terra.es> wrote: >Hi. > >I've found a bug in HTMLParser parsing some of my webpages. The >problem is using an attribute with a value inside double quotes >which is near another attribute. I've created a small testcase Too "near" to be legal HTML 4.0, I believe. From the spec: (http://www.w3.org/TR/1998/REC-html40-19980424) """ 3.2.2 Attributes Elements may have associated properties, called attributes, which may have values (by default, or set by authors or scripts). Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order. ^^^^^^^^^^^^^^^^^^^ """ Your DTD specification is HTML 4.0, but even if it's trying to do new XHTML stuff, XML requires a space before each attribute definition, i.e., from my XML spec copy of http://www.w3.org/TR/1998/REC-xml-19980210 STag ::= '<' Name (S Attribute)* S? '>' where S ::= (#x20 | #x9 | #xD | #xA) so it surprises me that you get an ok validation, though I'm not surprised that browsers ignore anomalies. >which you can see below. The w3c validator says the page is ok >(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html), >and browsers render it without problems. Does it happen with newer >Python versions? What's the procedure for bug reports? > >PD: Don't CC me your replies. > >$ cat test.html ><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> ><html><head><title>t</title> ><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> ></head><body> ><a href="http://ss"title="pe">P</a> ^^^^^^^^^^ -- need white space in front of this, e.g., <a href="http://ss" title="pe">P</a> ></body></html> > >$ python >Python 2.2.1 (#1, Apr 21 2002, 08:38:44) >[GCC 2.95.4 20011002 (Debian prerelease)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>> from HTMLParser import HTMLParser >>>> p = HTMLParser() >>>> file = open("test.html", "rt") >>>> p.feed("".join(file.readlines())) >>>> file.close() >>>> p.close() >Traceback (most recent call last): > File "<stdin>", line 1, in ? > File "/usr/lib/python2.2/HTMLParser.py", line 112, in close > self.goahead(1) > File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead > self.error("EOF in middle of construct") > File "/usr/lib/python2.2/HTMLParser.py", line 115, in error > raise HTMLParseError(message, self.getpos()) >HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1 > Seems like a better message could have been generated, though. Regards, Bengt Richter
- Previous message (by thread): One other possibilty for PEP 308...
- Next message (by thread): Found a parsing bug in HTMLParser
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list