mem: exclude unused spaCy pipeline components to reduce model memory by KRRT7 · Pull Request #4296 · Unstructured-IO/unstructured
Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection. Saves ~12 MiB of model weights per process.
Merged
via the queue into
Unstructured-IO:main
with commit a3172f8
53 of 54 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters