Issue 36789: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

Issue 36789: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

Issue36789

Created on 2019-05-04 00:00 by mbiggs, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 13111	merged	redshiftzero, 2019-05-06 15:08
PR 13188	closed	mbiggs, 2019-05-08 11:13
PR 13383	merged	miss-islington, 2019-05-17 10:48

Messages (6)
msg341363 - (view)	Author: mbiggs (mbiggs) *	Date: 2019-05-04 00:00
In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html It says the following: "UTF-8 has several convenient properties: (...) 2. A Unicode string is turned into a sequence of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes." This is not right. UTF-8 uses the zero byte to represent the Unicode codepoint U+0000 (the ASCII NULL character). This is a valid character in UTF-8 and is handled just fine by python's UTF-8 string encoding/decoding.
msg341364 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2019-05-04 00:06
This is right for 99.99% cases: utf8 doesn't encode any character except explicit zero with zero bytes. UTF-16 for example encodes 'a' as b'\xff\xfea\x00'
msg341414 - (view)	Author: mbiggs (mbiggs) *	Date: 2019-05-05 01:27
So a correct statement would be "A UTF-8 string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the NULL character (U+0000)." I think it's important to correct this because the part about processing UTF-8 with C functions like strcpy(), was wrong and could cause bugs.
msg341418 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-05-05 05:54
I agree that the documentation should be updated. Do you mind to create a pull request mbiggs? There are UTF-8 variants which guarantee that the encoded text has no zero bytes (see Modified UTF-8), but Python only provides the standard UTF-8 and UTF-8 with BOM.
msg341477 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2019-05-05 22:25
Minor bikeshed: If updating the documentation, refer to U+0000 as "the null character" or "NUL", not "NULL". Using "NULL" allows for confusion with NULL pointers; "the null character" (the name used in the Unicode standard) or "NUL" (the official three letter abbreviation in ASCII, Unicode too I think) has no such opportunity for confusion.
msg341948 - (view)	Author: mbiggs (mbiggs) *	Date: 2019-05-08 21:46
Ah sent a pull request but didn't realize that redshiftzero already had. Their PR looks good to me. Thanks for fixing this!

History
Date	User	Action	Args
2022-04-11 14:59:14	admin	set	github: 80970
2019-05-17 11:05:14	cheryl.sabella	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2019-05-17 10:48:34	miss-islington	set	pull_requests: + pull_request13294
2019-05-08 21:46:59	mbiggs	set	messages: + msg341948
2019-05-08 11:13:00	mbiggs	set	pull_requests: + pull_request13102
2019-05-06 15:08:46	redshiftzero	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request13026
2019-05-06 01:44:48	ezio.melotti	set	nosy: + ezio.melotti type: enhancement
2019-05-05 22:25:59	josh.r	set	nosy: + josh.r messages: + msg341477
2019-05-05 05:54:52	serhiy.storchaka	set	versions: - Python 3.5, Python 3.6 nosy: + serhiy.storchaka messages: + msg341418 keywords: + easy stage: needs patch
2019-05-05 01:27:28	mbiggs	set	messages: + msg341414
2019-05-04 00:06:45	asvetlov	set	nosy: + asvetlov messages: + msg341364
2019-05-04 00:00:17	mbiggs	create