Issue26917
Created on 2016-05-03 08:48 by arigo, last changed 2022-04-11 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| hangul_composition.patch | vstinner, 2016-05-03 10:02 | review | ||
| Messages (10) | |||
|---|---|---|---|
| msg264697 - (view) | Author: Armin Rigo (arigo) * ![]() |
Date: 2016-05-03 08:48 | |
There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7). First, please note that my knowledge of unicode is limited, so I may be wrong and the following behavior might be perfectly correct.
>>> from unicodedata import normalize
>>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8'))
b'---\xea\xbe\xb8\xe1\x86\xa7---' # i.e., the same as the input
>>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8'))
b'---\xea\xbe\xb8---\xe3\xa4\xba'
Note how in the second example the initial two-character part is replaced with a single character (actually the first of them). This does not occur in the first example. In Python 2.6, both inputs would be normalized to the single-character output.
The new behavior introduced in Python 2.7 is to first do a quick-check on the string, and if this `is_normalized()` function returns 1, we know that the string should already be normalized and we return it unmodified. However, the example "\uafb8\u11a7" shows a contradictory behavior: it causes both is_normalized() to return 1, but actual normalization to change it. We can see in the second example above that if, for an unrelated reason, we force is_normalized() to return 0 (by adding some non-normalized character elsewhere in the string), then the "\uafb8\u11a7" is changed.
This is a bit unexpected, but I don't know if it is officially correct behavior or if the problem is a bug in `is_normalized()`.
|
|||
| msg264698 - (view) | Author: Armin Rigo (arigo) * ![]() |
Date: 2016-05-03 09:01 | |
Note: the examples can also be written in this clearer way on Python 3:
>>> from unicodedata import normalize
>>> print(ascii(normalize("NFC", "---\uafb8\u11a7---")))
'---\uafb8\u11a7---'
>>> print(ascii(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1")))
'---\uafb8---\u393a'
|
|||
| msg264704 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-05-03 09:25 | |
Extract of unicodedata_UCD_normalize_impl():
if (strcmp(form, "NFC") == 0) {
if (is_normalized(self, input, 1, 0)) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 0);
}
is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1").
unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true.
unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}.
unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}.
It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field:
/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
as described in http://unicode.org/reports/tr15/#Annex8. */
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
|
|||
| msg264706 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-05-03 09:35 | |
I tested http://minaret.info/test/normalize.msp (1) 꾸ᆧ (afb8 11a7) --NFC or NFKC--> 꾸ᆧ (afb8, 11a7) === same than python 꾸ᆧ (afb8 11a7) --NFD or NFKD--> 꾸ᆧ (1101 116e, 11a7) === same than python (2) 꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸 (afb8) === same than python 꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸ᆧ (1101 116e, 11a7) === same than python (3) 꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFC or NFKC--> 꾸ᆧ㤺 (afb8, 11a7, 393a) == DIFFERENT than python, python eats the U+11a7 character 꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFD or NFKD--> 꾸ᆧ㤺 (1101 116e, 11a7, 393a) === same than python |
|||
| msg264707 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-05-03 10:01 | |
Extract of nfc_nfkc():
/* Hangul Composition. We don't need to check for <LV,T>
pairs, since we always have decomposed data. */
code = PyUnicode_READ(kind, data, i);
if (LBase <= code && code < (LBase+LCount) &&
i + 1 < len &&
VBase <= PyUnicode_READ(kind, data, i+1) &&
PyUnicode_READ(kind, data, i+1) <= (VBase+VCount)) {
int LIndex, VIndex;
LIndex = code - LBase;
VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
code = SBase + (LIndex*VCount+VIndex)*TCount;
i+=2;
if (i < len &&
TBase <= PyUnicode_READ(kind, data, i) &&
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
code += PyUnicode_READ(kind, data, i)-TBase;
i++;
}
output[o++] = code;
continue;
}
With the input string (1101 116e, 11a7), we get:
* LIndex = 1
* VIndex = 13
code = SBase + (LIndex*VCount+VIndex)*TCount + (ch3 - TBase)
= 0xAC00 + (1 * 21 + 13) * 28 + 0
= 0xafb8
Constants:
* LBase = 0x1100, LCount = 19
* VBase = 0x1161, VCount = 21
* TBase = 0x11A7, TCount = 28
* SBase = 0xAC00
The problem is maybe than we used the 3rd character whereas (ch3 - TBase) is equal to 0.
|
|||
| msg264708 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-05-03 10:02 | |
Attached patch changes Hangul Composition. I'm not sure that it is correct. |
|||
| msg264711 - (view) | Author: Armin Rigo (arigo) * ![]() |
Date: 2016-05-03 10:29 | |
See also https://bitbucket.org/pypy/pypy/issues/2289/incorrect-unicode-normalization . It seems that you reached the same conclusion than the OP in that issue: the problem would really be that normalizing "\uafb8\u11a7" should not drop the second character. Both Python and PyPy do that, but Python adds the "is_normalized()" check, so in some cases it returns the correct unmodified result. |
|||
| msg314067 - (view) | Author: Ronan Lamy (Ronan.Lamy) * | Date: 2018-03-18 23:24 | |
Victor's patch is correct. I implemented the same fix in PyPy in https://bitbucket.org/pypy/pypy/commits/92b4fb5b9e58 |
|||
| msg314076 - (view) | Author: Ma Lin (malin) * | Date: 2018-03-19 02:45 | |
> Victor's patch is correct. I'm afraid you are wrong. Please see PR 1958 in issue29456, IMO this PR can be merged. |
|||
| msg319803 - (view) | Author: Ma Lin (malin) * | Date: 2018-06-17 02:50 | |
This issue can be closed, already fixed in issue29456 Also, PyPy's current code is correct. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:30 | admin | set | github: 71104 |
| 2018-06-17 17:10:10 | benjamin.peterson | set | status: open -> closed resolution: fixed stage: resolved |
| 2018-06-17 02:50:08 | malin | set | messages: + msg319803 |
| 2018-03-19 02:45:21 | malin | set | nosy:
+ malin messages: + msg314076 |
| 2018-03-18 23:24:10 | Ronan.Lamy | set | nosy:
+ Ronan.Lamy messages: + msg314067 |
| 2016-05-03 10:29:22 | arigo | set | messages: + msg264711 |
| 2016-05-03 10:02:11 | vstinner | set | files:
+ hangul_composition.patch keywords: + patch messages: + msg264708 |
| 2016-05-03 10:01:41 | vstinner | set | messages: + msg264707 |
| 2016-05-03 09:36:18 | vstinner | set | title: Inconsistency in unicodedata.normalize()? -> unicodedata.normalize(): bug in Hangul Composition |
| 2016-05-03 09:35:40 | vstinner | set | messages: + msg264706 |
| 2016-05-03 09:25:45 | vstinner | set | messages: + msg264704 |
| 2016-05-03 09:01:37 | arigo | set | messages: + msg264698 |
| 2016-05-03 08:48:27 | arigo | create | |
