Issue 26917: unicodedata.normalize(): bug in Hangul Composition

Issue26917

Created on 2016-05-03 08:48 by arigo, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
hangul_composition.patch	vstinner, 2016-05-03 10:02		review

Messages (10)
msg264697 - (view)	Author: Armin Rigo (arigo) *	Date: 2016-05-03 08:48
There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7). First, please note that my knowledge of unicode is limited, so I may be wrong and the following behavior might be perfectly correct. >>> from unicodedata import normalize >>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8')) b'---\xea\xbe\xb8\xe1\x86\xa7---' # i.e., the same as the input >>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8')) b'---\xea\xbe\xb8---\xe3\xa4\xba' Note how in the second example the initial two-character part is replaced with a single character (actually the first of them). This does not occur in the first example. In Python 2.6, both inputs would be normalized to the single-character output. The new behavior introduced in Python 2.7 is to first do a quick-check on the string, and if this `is_normalized()` function returns 1, we know that the string should already be normalized and we return it unmodified. However, the example "\uafb8\u11a7" shows a contradictory behavior: it causes both is_normalized() to return 1, but actual normalization to change it. We can see in the second example above that if, for an unrelated reason, we force is_normalized() to return 0 (by adding some non-normalized character elsewhere in the string), then the "\uafb8\u11a7" is changed. This is a bit unexpected, but I don't know if it is officially correct behavior or if the problem is a bug in `is_normalized()`.
msg264698 - (view)	Author: Armin Rigo (arigo) *	Date: 2016-05-03 09:01
Note: the examples can also be written in this clearer way on Python 3: >>> from unicodedata import normalize >>> print(ascii(normalize("NFC", "---\uafb8\u11a7---"))) '---\uafb8\u11a7---' >>> print(ascii(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1"))) '---\uafb8---\u393a'
msg264704 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-03 09:25
Extract of unicodedata_UCD_normalize_impl(): if (strcmp(form, "NFC") == 0) { if (is_normalized(self, input, 1, 0)) { Py_INCREF(input); return input; } return nfc_nfkc(self, input, 0); } is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1"). unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true. unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}. unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}. It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field: /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No, as described in http://unicode.org/reports/tr15/#Annex8. */ quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
msg264706 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-03 09:35
I tested http://minaret.info/test/normalize.msp (1) 꾸ᆧ (afb8 11a7) --NFC or NFKC--> 꾸ᆧ (afb8, 11a7) === same than python 꾸ᆧ (afb8 11a7) --NFD or NFKD--> 꾸ᆧ (1101 116e, 11a7) === same than python (2) 꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸 (afb8) === same than python 꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸ᆧ (1101 116e, 11a7) === same than python (3) 꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFC or NFKC--> 꾸ᆧ㤺 (afb8, 11a7, 393a) == DIFFERENT than python, python eats the U+11a7 character 꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFD or NFKD--> 꾸ᆧ㤺 (1101 116e, 11a7, 393a) === same than python
msg264707 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-03 10:01
Extract of nfc_nfkc(): /* Hangul Composition. We don't need to check for <LV,T> pairs, since we always have decomposed data. / code = PyUnicode_READ(kind, data, i); if (LBase <= code && code < (LBase+LCount) && i + 1 < len && VBase <= PyUnicode_READ(kind, data, i+1) && PyUnicode_READ(kind, data, i+1) <= (VBase+VCount)) { int LIndex, VIndex; LIndex = code - LBase; VIndex = PyUnicode_READ(kind, data, i+1) - VBase; code = SBase + (LIndexVCount+VIndex)TCount; i+=2; if (i < len && TBase <= PyUnicode_READ(kind, data, i) && PyUnicode_READ(kind, data, i) <= (TBase+TCount)) { code += PyUnicode_READ(kind, data, i)-TBase; i++; } output[o++] = code; continue; } With the input string (1101 116e, 11a7), we get: LIndex = 1 * VIndex = 13 code = SBase + (LIndexVCount+VIndex)TCount + (ch3 - TBase) = 0xAC00 + (1 * 21 + 13) * 28 + 0 = 0xafb8 Constants: * LBase = 0x1100, LCount = 19 * VBase = 0x1161, VCount = 21 * TBase = 0x11A7, TCount = 28 * SBase = 0xAC00 The problem is maybe than we used the 3rd character whereas (ch3 - TBase) is equal to 0.
msg264708 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-03 10:02
Attached patch changes Hangul Composition. I'm not sure that it is correct.
msg264711 - (view)	Author: Armin Rigo (arigo) *	Date: 2016-05-03 10:29
See also https://bitbucket.org/pypy/pypy/issues/2289/incorrect-unicode-normalization . It seems that you reached the same conclusion than the OP in that issue: the problem would really be that normalizing "\uafb8\u11a7" should not drop the second character. Both Python and PyPy do that, but Python adds the "is_normalized()" check, so in some cases it returns the correct unmodified result.
msg314067 - (view)	Author: Ronan Lamy (Ronan.Lamy) *	Date: 2018-03-18 23:24
Victor's patch is correct. I implemented the same fix in PyPy in https://bitbucket.org/pypy/pypy/commits/92b4fb5b9e58
msg314076 - (view)	Author: Ma Lin (malin) *	Date: 2018-03-19 02:45
> Victor's patch is correct. I'm afraid you are wrong. Please see PR 1958 in issue29456, IMO this PR can be merged.
msg319803 - (view)	Author: Ma Lin (malin) *	Date: 2018-06-17 02:50
This issue can be closed, already fixed in issue29456 Also, PyPy's current code is correct.

History
Date	User	Action	Args
2022-04-11 14:58:30	admin	set	github: 71104
2018-06-17 17:10:10	benjamin.peterson	set	status: open -> closed resolution: fixed stage: resolved
2018-06-17 02:50:08	malin	set	messages: + msg319803
2018-03-19 02:45:21	malin	set	nosy: + malin messages: + msg314076
2018-03-18 23:24:10	Ronan.Lamy	set	nosy: + Ronan.Lamy messages: + msg314067
2016-05-03 10:29:22	arigo	set	messages: + msg264711
2016-05-03 10:02:11	vstinner	set	files: + hangul_composition.patch keywords: + patch messages: + msg264708
2016-05-03 10:01:41	vstinner	set	messages: + msg264707
2016-05-03 09:36:18	vstinner	set	title: Inconsistency in unicodedata.normalize()? -> unicodedata.normalize(): bug in Hangul Composition
2016-05-03 09:35:40	vstinner	set	messages: + msg264706
2016-05-03 09:25:45	vstinner	set	messages: + msg264704
2016-05-03 09:01:37	arigo	set	messages: + msg264698
2016-05-03 08:48:27	arigo	create