Issue36789
Created on 2019-05-04 00:00 by mbiggs, last changed 2022-04-11 14:59 by admin. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 13111 | merged | redshiftzero, 2019-05-06 15:08 | |
| PR 13188 | closed | mbiggs, 2019-05-08 11:13 | |
| PR 13383 | merged | miss-islington, 2019-05-17 10:48 | |
| Messages (6) | |||
|---|---|---|---|
| msg341363 - (view) | Author: mbiggs (mbiggs) * | Date: 2019-05-04 00:00 | |
In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html It says the following: "UTF-8 has several convenient properties: (...) 2. A Unicode string is turned into a sequence of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes." This is not right. UTF-8 uses the zero byte to represent the Unicode codepoint U+0000 (the ASCII NULL character). This is a valid character in UTF-8 and is handled just fine by python's UTF-8 string encoding/decoding. |
|||
| msg341364 - (view) | Author: Andrew Svetlov (asvetlov) * ![]() |
Date: 2019-05-04 00:06 | |
This is right for 99.99% cases: utf8 doesn't encode any character except explicit zero with zero bytes. UTF-16 for example encodes 'a' as b'\xff\xfea\x00' |
|||
| msg341414 - (view) | Author: mbiggs (mbiggs) * | Date: 2019-05-05 01:27 | |
So a correct statement would be "A UTF-8 string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the NULL character (U+0000)." I think it's important to correct this because the part about processing UTF-8 with C functions like strcpy(), was wrong and could cause bugs. |
|||
| msg341418 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2019-05-05 05:54 | |
I agree that the documentation should be updated. Do you mind to create a pull request mbiggs? There are UTF-8 variants which guarantee that the encoded text has no zero bytes (see Modified UTF-8), but Python only provides the standard UTF-8 and UTF-8 with BOM. |
|||
| msg341477 - (view) | Author: Josh Rosenberg (josh.r) * ![]() |
Date: 2019-05-05 22:25 | |
Minor bikeshed: If updating the documentation, refer to U+0000 as "the null character" or "NUL", not "NULL". Using "NULL" allows for confusion with NULL pointers; "the null character" (the name used in the Unicode standard) or "NUL" (the official three letter abbreviation in ASCII, Unicode too I think) has no such opportunity for confusion. |
|||
| msg341948 - (view) | Author: mbiggs (mbiggs) * | Date: 2019-05-08 21:46 | |
Ah sent a pull request but didn't realize that redshiftzero already had. Their PR looks good to me. Thanks for fixing this! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:14 | admin | set | github: 80970 |
| 2019-05-17 11:05:14 | cheryl.sabella | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2019-05-17 10:48:34 | miss-islington | set | pull_requests: + pull_request13294 |
| 2019-05-08 21:46:59 | mbiggs | set | messages: + msg341948 |
| 2019-05-08 11:13:00 | mbiggs | set | pull_requests: + pull_request13102 |
| 2019-05-06 15:08:46 | redshiftzero | set | keywords:
+ patch stage: needs patch -> patch review pull_requests: + pull_request13026 |
| 2019-05-06 01:44:48 | ezio.melotti | set | nosy:
+ ezio.melotti type: enhancement |
| 2019-05-05 22:25:59 | josh.r | set | nosy:
+ josh.r messages: + msg341477 |
| 2019-05-05 05:54:52 | serhiy.storchaka | set | versions:
- Python 3.5, Python 3.6 nosy: + serhiy.storchaka messages: + msg341418 keywords:
+ easy |
| 2019-05-05 01:27:28 | mbiggs | set | messages: + msg341414 |
| 2019-05-04 00:06:45 | asvetlov | set | nosy:
+ asvetlov messages: + msg341364 |
| 2019-05-04 00:00:17 | mbiggs | create | |

