[C API] Add an efficient public PyUnicodeWriter API
Feature or enhancement
Creating a Python string object in an efficient way is complicated. Python has private _PyUnicodeWriter API. It's being used by these projects:
Affected projects (5):
- Cython (3.0.9)
- asyncpg (0.29.0)
- catboost (1.2.3)
- frozendict (2.4.0)
- immutables (0.20)
I propose making the API public to promote it and help C extensions maintainers to write more efficient code to create Python string objects.
API:
typedef struct PyUnicodeWriter PyUnicodeWriter; PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void); PyAPI_FUNC(void) PyUnicodeWriter_Discard(PyUnicodeWriter *writer); PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer); PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate( PyUnicodeWriter *writer, int overallocate); PyAPI_FUNC(int) PyUnicodeWriter_WriteChar( PyUnicodeWriter *writer, Py_UCS4 ch); PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8( PyUnicodeWriter *writer, const char *str, // decoded from UTF-8 Py_ssize_t len); // use strlen() if len < 0 PyAPI_FUNC(int) PyUnicodeWriter_Format( PyUnicodeWriter *writer, const char *format, ...); // Write str(obj) PyAPI_FUNC(int) PyUnicodeWriter_WriteStr( PyUnicodeWriter *writer, PyObject *obj); // Write repr(obj) PyAPI_FUNC(int) PyUnicodeWriter_WriteRepr( PyUnicodeWriter *writer, PyObject *obj); // Write str[start:end] PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring( PyUnicodeWriter *writer, PyObject *str, Py_ssize_t start, Py_ssize_t end);
The internal writer buffer is overallocated by default. PyUnicodeWriter_Finish() truncates the buffer to the exact size if the buffer was overallocated.
Overallocation reduces the cost of exponential complexity when adding short strings in a loop. Use PyUnicodeWriter_SetOverallocate(writer, 0) to disable overallocation just before the last write.
The writer takes care of the internal buffer kind: Py_UCS1 (latin1), Py_UCS2 (BMP) or Py_UCS4 (full Unicode Character Set). It also implements an optimization if a single write is made using PyUnicodeWriter_WriteStr(): it returns the string unchanged without any copy.
Example of usage (simplified code from Python/unionobject.c):
static PyObject * union_repr(PyObject *self) { unionobject *alias = (unionobject *)self; Py_ssize_t len = PyTuple_GET_SIZE(alias->args); PyUnicodeWriter *writer = PyUnicodeWriter_Create(); if (writer == NULL) { return NULL; } for (Py_ssize_t i = 0; i < len; i++) { if (i > 0 && PyUnicodeWriter_WriteUTF8(writer, " | ", 3) < 0) { goto error; } PyObject *p = PyTuple_GET_ITEM(alias->args, i); if (PyUnicodeWriter_WriteRepr(writer, p) < 0) { goto error; } } return PyUnicodeWriter_Finish(writer); error: PyUnicodeWriter_Discard(writer); return NULL; }
Linked PRs
- gh-119182: Add PyUnicodeWriter C API #119184
- gh-119396: Optimize PyUnicode_FromFormat() UTF-8 decoder #119398
- gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248
- gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307
- gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639
- gh-119182: Optimize PyUnicode_FromFormat() #120796
- gh-119182: Use public PyUnicodeWriter API in union_repr() #120797
- gh-119182: Use public PyUnicodeWriter API in ga_repr() #120799
- gh-119182: Use public PyUnicodeWriter in contextvar_tp_repr() #120809
- gh-119182: Rewrite PyUnicodeWriter tests in Python #120845
- gh-119182: Add PyUnicodeWriter_WriteUCS4() function #120849
- gh-119182: Use PyUnicodeWriter_WriteWideChar() #120851
- gh-119182: Add checks to PyUnicodeWriter APIs #120870
- gh-119182: Complete PyUnicodeWriter documentation #127607
- gh-119182: Use public PyUnicodeWriter in wrap_strftime() #129206
- gh-119182: Use public PyUnicodeWriter in time_strftime() #129207
- gh-119182: Use public PyUnicodeWriter in ast_unparse.c #129208
- gh-119182: Use public PyUnicodeWriter in Python-ast.c #129209
- gh-119182: Use public PyUnicodeWriter in stringio.c #129243
- gh-119182: Use public PyUnicodeWriter in _json.c #129249