Issue 33255: json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'

Issue 33255: json.dumps has different behaviour if encoding='utf-8' or encoding='utf8'

Issue33255

Created on 2018-04-10 09:21 by nhatcher, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 6523	closed	nhatcher, 2018-04-18 19:44

Messages (5)
msg315164 - (view)	Author: Nicolás Hatcher (nhatcher) *	Date: 2018-04-10 09:21
Hey I'm new here, so please let me know what incorrect things I am doing! I _think_ `json.dumps(o, ensure_ascii=False)` is doing the wrong thing when `o` has both unicode and str keys/values. For instance: ``` import json o = {u"greeting": "hi", "currency": "€"} json.dumps(o, ensure_ascii=False, encoding="utf8") json.dumps(o, ensure_ascii=False) ``` The first `dumps` will work while the second will fail. the reason is: https://github.com/python/cpython/blob/2.7/Lib/json/encoder.py#L198 This will decode any str if the encoding is not 'utf-8'. In the mixed case (unicode and str) this will blow. I workaround is to use any of the aliases for 'utf-8' like 'utf8' or 'u8'. I would be crazy happy to provide a PR if this is really an issue. Let me know if extra clarification is needed. Nicolás
msg315270 - (view)	Author: Ivan Pozdeev (Ivan.Pozdeev) *	Date: 2018-04-13 22:20
Treating 'utf-8' and its aliases differently (when they specifically mean the Python's, rather than something else's, encoding) is definitely as issue. You shouldn't hardcode a list of aliases though; rather use existing facilities to resolve them. From quick googling, e.g. `codecs.lookup(<encoding>).name` can get the canonical name. Make sure to follow https://devguide.python.org/pullrequest when doing the PR; a test case will likely be needed, too.
msg315478 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-19 05:47
In simplejson: >>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False, encoding="utf8") u'{"currency": "\u20ac", "greeting": "hi"}' >>> simplejson.dumps({u"greeting": "hi", "currency": "€"}, ensure_ascii=False) u'{"currency": "\u20ac", "greeting": "hi"}' I think it makes sense to fix the case for "utf-8".
msg315890 - (view)	Author: Nicolás Hatcher (nhatcher) *	Date: 2018-04-29 12:08
Hi Sehriy, I am ok with that change. I think it makes much more sense, but I also think it will break people's codes. At least with the simplest fix in which: >>> json.dumps({"g"}, ensure_ascii=False) u'"g"' Which is again compatible with simplejson. Although the documentation is not clear in this point there might be code out there relaying on this behaviour. Is that acceptable?
msg315895 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-29 13:36
You could decode only non-ascii strings. But I'm not sure that it is worth to change something in 2.7. This could be treated aa a new feature. Left this on to Benjamin, the release manager of 2.7.

History
Date	User	Action	Args
2022-04-11 14:58:59	admin	set	github: 77436
2020-01-05 20:50:13	cheryl.sabella	set	status: open -> closed resolution: wont fix stage: patch review -> resolved
2018-08-15 13:57:33	mcepl	set	nosy: + mcepl
2018-04-29 13:36:39	serhiy.storchaka	set	nosy: + benjamin.peterson messages: + msg315895
2018-04-29 12:08:30	nhatcher	set	messages: + msg315890
2018-04-19 05:47:35	serhiy.storchaka	set	nosy: + bob.ippolito, serhiy.storchaka messages: + msg315478
2018-04-18 19:44:34	nhatcher	set	keywords: + patch stage: patch review pull_requests: + pull_request6217
2018-04-13 22:20:39	Ivan.Pozdeev	set	nosy: + Ivan.Pozdeev messages: + msg315270
2018-04-10 09:21:14	nhatcher	create