gh-93033: Use wmemchr in find_char and replace_1char_inplace by goldsteinn · Pull Request #93034 · python/cpython

Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.

wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
with just about all the other wide-char string functions) can and
often does (x86_64 for example) assume that the input is aligned
relative to the sizeof(wchar_t). If this is not the case for
Py_UCS{2|4} then this patch is broken.

Also I think the way I implemented #define STRINGLIB_FAST_MEMCHR for
ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
the patch is fine, any suggestions for how to fix it?

$> ./python -m test -j4
...
== Tests result: SUCCESS ==

406 tests OK.

30 tests skipped:
    test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll
    test_idle test_ioctl test_kqueue test_launcher test_msilib
    test_nis test_ossaudiodev test_readline test_smtpnet
    test_socketserver test_sqlite3 test_startfile test_tcl test_tix
    test_tk test_ttk_guionly test_ttk_textonly test_turtle
    test_urllib2net test_urllibnet test_winconsoleio test_winreg
    test_winsound test_xmlrpc_net test_zipfile64
./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision
No wmemchr  : 1000 loops, best of 100: 127 nsec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision
No wmemchr  : 1000 loops, best of 100: 1.29 usec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision
No wmemchr  : 1000 loops, best of 100: 136 nsec per loop
With wmemchr: 1000 loops, best of 100: 130 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision
No wmemchr  : 1000 loops, best of 100: 1.35 usec per loop
With wmemchr: 1000 loops, best of 100: 131 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision
No wmemchr  : 1000 loops, best of 100: 50.2 nsec per loop
With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision
No wmemchr  : 1000 loops, best of 100: 69.1 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision
No wmemchr  : 1000 loops, best of 100: 53.6 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision
No wmemchr  : 1000 loops, best of 100: 69 nsec per loop
With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop