fix: pdfminer drops extractable text by qued · Pull Request #4310 · Unstructured-IO/unstructured
added 4 commits
March 31, 2026 16:33qued and others added 3 commits
March 31, 2026 16:52Track actual mapping count across both cidrange and cidchar paths instead of using len(code2cid) which undercounts multi-byte mappings stored in nested dicts. Also apply the cap to begincidchar entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qued and others added 4 commits
March 31, 2026 17:11qued and others added 6 commits
March 31, 2026 21:15- Reject decompressed CMap streams over 1MB before any regex runs, bounding parse cost against malicious inputs - Extract /WMode from the stream so vertical writing mode is preserved Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace post-construction font.cmap mutation with a CustomPDFCIDFont subclass that overrides get_cmap_from_spec, so all constructor-time state (vertical mode, widths, displacements) derives from the correct CMap. Add _decode_pdfstream_with_limit to decode embedded CMap streams with a hard output size limit without calling get_data() or mutating the PDFStream object. Uses incremental zlib decompression that bails early on oversized payloads. Replace the previous CustomPDFResourceManager (which swapped font.cmap after construction) with one that routes CID font subtypes through CustomPDFCIDFont during construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qued
deleted the
fix/pdfminer-drops-extractable-text
branch
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters