fix: pdfminer drops extractable text by qued · Pull Request #4310 · Unstructured-IO/unstructured

added 4 commits

March 31, 2026 16:33

cursor[bot]

qued and others added 3 commits

March 31, 2026 16:52
Add _MAX_CODE2CID_MAPPINGS (131072) limit to _parse_embedded_cmap_stream
to prevent malicious PDFs with enormous begincidrange spans from causing
excessive memory/CPU usage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor[bot]

@qued @claude

Track actual mapping count across both cidrange and cidchar paths
instead of using len(code2cid) which undercounts multi-byte mappings
stored in nested dicts. Also apply the cap to begincidchar entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor[bot]

qued and others added 4 commits

March 31, 2026 17:11
- Skip begincidrange entries where end < start to prevent negative
  range_size from decrementing the DoS counter
- Use resolve1() on Encoding value to handle indirect PDFObjRef
  references before the isinstance(PDFStream) check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

badGarnet

badGarnet

qued and others added 6 commits

March 31, 2026 21:15
- Reject decompressed CMap streams over 1MB before any regex runs,
  bounding parse cost against malicious inputs
- Extract /WMode from the stream so vertical writing mode is preserved

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace post-construction font.cmap mutation with a CustomPDFCIDFont
subclass that overrides get_cmap_from_spec, so all constructor-time
state (vertical mode, widths, displacements) derives from the correct
CMap.

Add _decode_pdfstream_with_limit to decode embedded CMap streams with
a hard output size limit without calling get_data() or mutating the
PDFStream object. Uses incremental zlib decompression that bails early
on oversized payloads.

Replace the previous CustomPDFResourceManager (which swapped font.cmap
after construction) with one that routes CID font subtypes through
CustomPDFCIDFont during construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cragwolfe

@qued qued deleted the fix/pdfminer-drops-extractable-text branch

April 1, 2026 19:20