Duplicate entry in 'Objects/unicodetype_db.h' · Issue #91399 · python/cpython

This one is so tiny that I'm not really sure we want to merge it…

=== Problem ===

[Objects/unicodetype_db.h](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h) starts in a following way:

/* a list of unique character type descriptors */
const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = {
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 32},
    {0, 0, 0, 0, 0, 48},
    …

The 1st record ({0, 0, 0, 0, 0, 0}) is duplicated.
This is not a problem, since the 1st occurrence is never used, but if we wanted to remove it then this is the ticket about it.

=== Detailed description ===

[Objects/unicodetype_db.h](https://github.com/python/cpython/blob/main/Objects/unicodetype_db.h) is generated by [Tools/unicode/makeunicodedata.py](https://github.com/python/cpython/blob/main/Tools/unicode/makeunicodedata.py) (I removed irrelevant lines):

def makeunicodetype(unicode, trace):
    dummy = (0, 0, 0, 0, 0, 0)
    table = [dummy] # (1)
    cache = {0: dummy} # (2)

    for char in unicode.chars:
        # Things…

        item = (upper, lower, title, decimal, digit, flags)

        i = cache.get(item) # (3)
        if i is None:
            cache[item] = i = len(table)
            table.append(item)

        index[char] = i
  • (1) - list which contains unique character properties (as (upper, lower, title, decimal, digit, flags) tuples)
  • (2) - mapping from character properties to index in table - improperly initialized as a mapping from index to character properties
  • (3) - we check if the current tuple is in cache

=== Result ===

The first time we get to a character that has (0, 0, 0, 0, 0, 0) properties (which is code point 0 - NULL) we check if it is in cache. It it not (there is an entry that goes from index 0 to (0, 0, 0, 0, 0, 0) - the other way around), so we add this entry to table and cache.

=== Fix ===

In the line (2) we should have: cache = {dummy: 0}. Obviously after doing so we have to run makeunicodedata.py - this is why this simple change modifies a lot of lines.

I will submit PR on github in just a sec…