Issue 17089: Expat parser parses strings only when XML encoding is UTF-8
xmlparser.Parse() works with string data only if XML encoding is utf-8 (or ascii). Examples:
>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-8'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='iso8859'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-16'?><tag>\xb5</tag>")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: encoding specified in XML declaration is incorrect: line 1, column 30
This affects all other modules which works with XML: xml.sax, xml.dom.minidom, xml.dom.pulldom, xml.etree.ElementTree.
Here is a patch which fixes parsing string data with non-UTF-8 XML.