Duplicate entry in 'Objects/unicodetype_db.h' · Issue #91399 · python/cpython
This one is so tiny that I'm not really sure we want to merge it…
=== Problem ===
[Objects/unicodetype_db.h](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h) starts in a following way:
/* a list of unique character type descriptors */ const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = { {0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 32}, {0, 0, 0, 0, 0, 48}, …
The 1st record ({0, 0, 0, 0, 0, 0}) is duplicated.
This is not a problem, since the 1st occurrence is never used, but if we wanted to remove it then this is the ticket about it.
=== Detailed description ===
[Objects/unicodetype_db.h](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h) is generated by [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) (I removed irrelevant lines):
def makeunicodetype(unicode, trace): dummy = (0, 0, 0, 0, 0, 0) table = [dummy] # (1) cache = {0: dummy} # (2) for char in unicode.chars: # Things… item = (upper, lower, title, decimal, digit, flags) i = cache.get(item) # (3) if i is None: cache[item] = i = len(table) table.append(item) index[char] = i
- (1) - list which contains unique character properties (as
(upper, lower, title, decimal, digit, flags)tuples) - (2) - mapping from character properties to index in
table- improperly initialized as a mapping from index to character properties - (3) - we check if the current tuple is in
cache
=== Result ===
The first time we get to a character that has (0, 0, 0, 0, 0, 0) properties (which is code point 0 - NULL) we check if it is in cache. It it not (there is an entry that goes from index 0 to (0, 0, 0, 0, 0, 0) - the other way around), so we add this entry to table and cache.
=== Fix ===
In the line (2) we should have: cache = {dummy: 0}. Obviously after doing so we have to run makeunicodedata.py - this is why this simple change modifies a lot of lines.
I will submit PR on github in just a sec…