I stand by that comment: IsWhiteSpace should use the Unicode White_Space property. Since FS/GS/RS/US are not in the White_Space property, it's correct that the int conversion fails. It's incorrect that .isspace() gives true.
There are really several bugs here:
- .isspace doesn't use the White_List property
- int conversion ultimately uses Py_ISSPACE, which conceptually could deviate from the Unicode properties (as it is byte-based). This is not really an issue, since they indeed match.
I propose to fix this by parsing PropList.txt, and generating _PyUnicode_IsWhitespace based on the White_Space property. For efficiency, it should also generate a fast-lookup array for the ASCII case, or just use _Py_ctype_table (with a comment that this table needs to match PropList White_Space). _Py_ascii_whitespace should go.
Contributions are welcome. |