Issue 33881: dataclasses should use NFKC to find duplicate members

Issue 33881: dataclasses should use NFKC to find duplicate members

Issue33881

Created on 2018-06-16 18:02 by eric.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 7916	closed	valer, 2018-06-25 21:58
PR 20837	closed	thatiparthy, 2020-06-12 18:49

Messages (8)
msg319766 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2018-06-16 18:02
See issue 33880 for the same issue with namedtuple. This shows up on dataclasses only through make_dataclass. This is an expected ValueError: >>> make_dataclass('a', ['a', 'b', 'c', 'a']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 1123, in make_dataclass raise TypeError(f'Field name duplicated: {name!r}') TypeError: Field name duplicated: 'a' But this is a SyntaxError: >>> make_dataclass('a', ['\u00b5', '\u03bc']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 1133, in make_dataclass unsafe_hash=unsafe_hash, frozen=frozen) File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 958, in dataclass return wrap(_cls) File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 950, in wrap return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen) File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 871, in _process_class else 'self', File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 490, in _init_fn return_type=None) File "/cygdrive/c/home/eric/.local/lib/python3.6/site-packages/dataclasses.py", line 356, in _create_fn exec(txt, globals, locals) File "<string>", line 1 SyntaxError: duplicate argument 'μ' in function definition
msg320446 - (view)	Author: Valeriya Sinevich (valer) *	Date: 2018-06-25 22:01
I sent a PR and measured how it affected the performance by creating a dataclass with 10000 members. python.bat -m timeit -s "from dataclasses import make_dataclass" -s "arg_list = [chr(k) * i for k in range(97, 123) for i in range(1, 500)]" "make_dataclass('a', arg_list)" The performance didn't change, both before and after the changes it takes around 14.5s.
msg320454 - (view)	Author: Steve Dower (steve.dower) *	Date: 2018-06-25 23:21
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255). I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part. >>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()] >>> def makename(): ... return ''.join(random.choice(CHRS) for _ in range(10)) ... >>> NAMES = [makename() for _ in range(10000)] >>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000) 38.04007526000004 >>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000) 820.2586788580002 I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway.
msg320455 - (view)	Author: Steve Dower (steve.dower) *	Date: 2018-06-25 23:23
> outside of the basic Latin characters (0-255) This should be (0-127). The Latin characters in 128-255 are considered extended ones, and may be decomposed by normalization.
msg371356 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2020-06-12 12:00
The pull request that was opened for this has been closed due to inactivity, so this issue is available for anyone to work on.
msg371424 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-12 22:19
I recommend leaving the code as-is. AFAICT, the situation almost never arises in practice, and if it did, the existing SyntaxError is clear. Given that the rest of that language uses a SyntaxError, there isn't a net benefit for switching to TypeError: >>> def f(µ=1, μ=2): pass SyntaxError: duplicate argument 'μ' in function definition >>> def f(**kwds): return kwds >>> f(µ=1, μ=2) SyntaxError: keyword argument repeated
msg371435 - (view)	Author: Srinivas Reddy Thatiparthy(శ్రీనివాస్ రెడ్డి తాటిపర్తి) (thatiparthy) *	Date: 2020-06-13 04:28
I am convinced by Raymond’s argument. Hence closing the PR.
msg371463 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-06-13 16:42
I agree with Raymond. There's no real harm being caused here.

History
Date	User	Action	Args
2022-04-11 14:59:01	admin	set	github: 78062
2020-06-13 16:42:00	eric.smith	set	status: open -> closed resolution: wont fix messages: + msg371463 stage: patch review -> resolved
2020-06-13 04:28:47	thatiparthy	set	messages: + msg371435
2020-06-12 22:19:45	rhettinger	set	nosy: + rhettinger messages: + msg371424
2020-06-12 18:49:29	thatiparthy	set	nosy: + thatiparthy pull_requests: + pull_request20031
2020-06-12 12:00:51	cheryl.sabella	set	nosy: + cheryl.sabella messages: + msg371356 versions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.7
2018-06-25 23:23:09	steve.dower	set	messages: + msg320455
2018-06-25 23:21:33	steve.dower	set	messages: + msg320454
2018-06-25 22:01:48	valer	set	nosy: + valer, steve.dower messages: + msg320446
2018-06-25 21:58:52	valer	set	keywords: + patch stage: patch review pull_requests: + pull_request7519
2018-06-16 19:08:50	eric.smith	set	title: dataclasses should use NFKD to find duplicate members -> dataclasses should use NFKC to find duplicate members
2018-06-16 18:02:08	eric.smith	create