mem: exclude unused spaCy pipeline components to reduce model memory by KRRT7 · Pull Request #4296 · Unstructured-IO/unstructured

badGarnet

Only tok2vec, tagger, and sentence splitting are used (pos_tag and
sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when
loading en_core_web_sm, and add lightweight sentencizer to replace the
dependency parser for sentence boundary detection.

Saves ~12 MiB of model weights per process.

@KRRT7

Per review feedback, removing parser and using sentencizer causes
sentence splitting regressions. Keep parser loaded, only exclude
ner, lemmatizer, and attribute_ruler.

@KRRT7

badGarnet

cragwolfe

Merged via the queue into Unstructured-IO:main with commit a3172f8

Mar 31, 2026

53 of 54 checks passed