Message 163706 - Python tracker

Message163706

Author	ezio.melotti
Recipients	Brian.Jones, eric.araujo, eric.smith, ezio.melotti, hp.dekoning, loewis, python-dev
Date	2012-06-24.03:11:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1340507496.32.0.519686741143.issue11113@psf.upfronthosting.co.za>
In-reply-to

Content
The problem is that the standard allows some charref to end without a ';', but not all of them. So both "&Eacuteric" and Éric" will be parsed as "Éric", but only "αcentauri" will result in "αcentauri" -- "&alphacentauri" will be returned unchanged. I'm now working on #15156 to use this dict in HTMLParser, and detecting the ';'-less entities is not easy. A possible solution is to keep the names that are accepted without ',' in a separate (private) dict and expose a function like HTMLParser.unescape that implements all the necessary logic. Regarding ChainMap, the html5 dict should be a superset of the html4 one.

Content

The problem is that the standard allows some charref to end without a ';', but not all of them.

So both "&Eacuteric" and &Eacute;ric" will be parsed as "Éric", but only "&alpha;centauri" will result in "αcentauri" -- "&alphacentauri" will be returned unchanged.

I'm now working on #15156 to use this dict in HTMLParser, and detecting the ';'-less entities is not easy.  A possible solution is to keep the names that are accepted without ',' in a separate (private) dict and expose a function like HTMLParser.unescape that implements all the necessary logic.

Regarding ChainMap, the html5 dict should be a superset of the html4 one.

History
Date	User	Action	Args
2012-06-24 03:11:36	ezio.melotti	set	recipients: + ezio.melotti, loewis, eric.smith, eric.araujo, Brian.Jones, python-dev, hp.dekoning
2012-06-24 03:11:36	ezio.melotti	set	messageid: <1340507496.32.0.519686741143.issue11113@psf.upfronthosting.co.za>
2012-06-24 03:11:35	ezio.melotti	link	issue11113 messages
2012-06-24 03:11:35	ezio.melotti	create