Issue755670
Created on 2003-06-17 03:09 by smroid, last changed 2022-04-10 16:09 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| diff.txt | smroid, 2003-06-17 03:09 | |||
| issue755670.diff | ezio.melotti, 2011-11-01 16:11 | Failing test | review | |
| Messages (11) | |||
|---|---|---|---|
| msg44032 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003-06-17 03:09 | |
HTML examples seen in the wild that cause parse errors
in HTMLParser include:
<a width="100%"cellspacing=0>
-- note lack of space between val and next attr name
<a foo=>
-- trailing attribute has no value after =
<a href=javascript:popup('/popup/html.html')>
-- javascript fragment with embedded quotes
My patch contains improvements to the 'attrfind' and
'locatestarttagend' regexps that allow these examples
to parse.
The existing test_htmlparser.py unit test continues to
pass, except for the one test case where it considers
<a foo=> to be an error.
I commented out that case and added new test cases to
cover the examples above.
|
|||
| msg44033 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003-06-17 03:10 | |
Logged In: YES user_id=159908 Base version for HTMLParser.py is 1.11.2.1; base version for test_htmlparser.py is 1.8.8.1 |
|||
| msg44034 - (view) | Author: Steven Rosenthal (smroid) | Date: 2003-06-18 03:22 | |
Logged In: YES user_id=159908 This also fixes bugs 683938 and 699079. |
|||
| msg44035 - (view) | Author: Titus Brown (titus) | Date: 2004-12-19 00:42 | |
Logged In: YES user_id=23486 This patch allows developers to override the behavior of HTMLParser when parsing malformed HTML. Normally HTMLParser calls the function self.error(), which raises an exception. This patch adds appropriate return values for situations where self.error has been redefined in subclasses to *not* raise an exception. It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues. The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error(). The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well. Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people. An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;). If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;). |
|||
| msg44036 - (view) | Author: Titus Brown (titus) | Date: 2004-12-19 00:45 | |
Logged In: YES user_id=23486 whoops, attached to wrong patch! dangitall. sorry... |
|||
| msg44037 - (view) | Author: Titus Brown (titus) | Date: 2004-12-19 06:58 | |
Logged In: YES user_id=23486 I don't think HTMLParser should parse clearly invalid HTML without complaining. Perhaps such behavior should be overrideable (see patch 755660 & comments therein), but this patch changes behavior to parse invalid HTML w/o complaint. Patch 699079, to "fix" similar behavior, was closed by adsr and bcannon because such behavior is correct. Patch 683938 is also similar but is being kept open for some reason. Recommend closing patch w/o applying. |
|||
| msg55984 - (view) | Author: T. Middleton (timtoo) | Date: 2007-09-17 21:20 | |
I for one thank smroid for the patch. I also have hit *all* of these cases in the wild. This patch makes real-life a lot less frustrating. This patch is surely a lot more preferable than HTMLParser's tendency to just throw up its hands and quietly give up. And one might even make a case that in the horrific world of HTML "standards" case #2 and #3 can be considered actually valid (as much as it hurts to say so). |
|||
| msg114245 - (view) | Author: Mark Lawrence (BreamoreBoy) * | Date: 2010-08-18 16:32 | |
There are messages both for and against the patch which contains a unit test. Can we have a statement from a knowledgeable HTML person as to whether the patch should be accepted or rejected. |
|||
| msg146786 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-01 16:11 | |
Attached patch includes the tests in diff.txt. On Python 3, with strict=False, the first test (adjacent attributes) passes, but the other two still fail. See also #12629. |
|||
| msg147611 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2011-11-14 16:57 | |
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/3c3009f63700 New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/16ed15ff0d7c New changeset 426f7a2b1826 by Ezio Melotti in branch 'default': #1745761, #755670, #13357, #12629, #1200313: merge with 3.2. http://hg.python.org/cpython/rev/426f7a2b1826 |
|||
| msg147619 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-11-14 17:07 | |
Fixed, thanks for the report! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-10 16:09:16 | admin | set | github: 38664 |
| 2011-11-14 17:07:52 | ezio.melotti | set | status: open -> closed versions: + Python 2.7 messages: + msg147619 dependencies:
- allow HTMLParser to continue after a parse error |
| 2011-11-14 16:57:13 | python-dev | set | nosy:
+ python-dev messages: + msg147611 |
| 2011-11-14 12:45:57 | ezio.melotti | set | assignee: ezio.melotti |
| 2011-11-01 16:11:55 | ezio.melotti | set | files:
+ issue755670.diff versions: + Python 3.3 nosy: + ezio.melotti, - BreamoreBoy messages: + msg146786 type: enhancement -> behavior |
| 2010-08-18 16:32:10 | BreamoreBoy | set | versions:
+ Python 3.2, - Python 2.7 nosy: + BreamoreBoy messages: + msg114245 stage: test needed -> patch review |
| 2009-02-12 03:01:19 | ajaksu2 | set | dependencies:
+ HTMLParser attribute parsing bug, allow HTMLParser to continue after a parse error type: enhancement stage: test needed versions: + Python 2.7, - Python 2.3 |
| 2007-09-17 21:20:47 | timtoo | set | nosy:
+ timtoo messages: + msg55984 |
| 2003-06-17 03:09:17 | smroid | create | |

