Message341377
| Author | eryksun |
|---|---|
| Recipients | Paul Monson, eryksun, methane, paul.moore, steve.dower, tim.golden, vstinner, zach.ware |
| Date | 2019-05-04.07:35:12 |
| SpamBayes Score | -1.0 |
| Marked as misclassified | Yes |
| Message-id | <1556955312.8.0.837007007732.issue36778@roundup.psfhosted.org> |
| In-reply-to |
| Content | |
|---|---|
> cp65001 is *not* utf-8: Microsoft decided to handle surrogates
> differently for some reasons.
Do you mean valid UTF-16 surrogate pairs? For example:
>>> codecs.code_page_encode(65001, '\ud800\udc00')
(b'\xf0\x90\x80\x80', 2)
PyUnicode_AsUnicodeAndSize is neutral about storing surrogate codes in a 16-bit wchar_t string. In particular, the Python string in this case contains two surrogate codes, but they're passed to WideCharToMultiByte as a UTF-16 surrogate pair for the single character U+10000.
Anyway, it seems to me this issue will be resolved if cp65001.py is rewritten without functools.partial. |
|
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2019-05-04 07:35:12 | eryksun | set | recipients: + eryksun, paul.moore, vstinner, tim.golden, methane, zach.ware, steve.dower, Paul Monson |
| 2019-05-04 07:35:12 | eryksun | set | messageid: <1556955312.8.0.837007007732.issue36778@roundup.psfhosted.org> |
| 2019-05-04 07:35:12 | eryksun | link | issue36778 messages |
| 2019-05-04 07:35:12 | eryksun | create | |