Message 136194 - Python tracker

Message136194

Author	vstinner
Recipients	hyeshik.chang, lemburg, vstinner
Date	2011-05-17.23:10:03
SpamBayes Score	3.2035712e-06
Marked as misclassified	No
Message-id	<1305673805.66.0.039322983909.issue12100@psf.upfronthosting.co.za>
In-reply-to

Content
Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output: >>> import codecs >>> encoder = codecs.getincrementalencoder('hz')() >>> encoder.encode('\u804a') + encoder.encode('\u804a') b'~{AD~}~{AD~}' >>> '\u804a\u804a'.encode('hz') b'~{ADAD~}' Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp). Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue #12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first? I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0). See also issues #12016 and #12057.

Content

Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output:

>>> import codecs
>>> encoder = codecs.getincrementalencoder('hz')()
>>> encoder.encode('\u804a') + encoder.encode('\u804a')
b'~{AD~}~{AD~}'
>>> '\u804a\u804a'.encode('hz')
b'~{ADAD~}'

Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp).

Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue #12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first?

I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0).

See also issues #12016 and #12057.

History
Date	User	Action	Args
2011-05-17 23:10:05	vstinner	set	recipients: + vstinner, lemburg, hyeshik.chang
2011-05-17 23:10:05	vstinner	set	messageid: <1305673805.66.0.039322983909.issue12100@psf.upfronthosting.co.za>
2011-05-17 23:10:05	vstinner	link	issue12100 messages
2011-05-17 23:10:04	vstinner	create