fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode by KRRT7 · Pull Request #4266 · Unstructured-IO/unstructured

and others added 7 commits

February 26, 2026 07:35
Runtime improvement: the change reduces average call time from ~174μs to ~128μs (≈35% faster overall), with the biggest wins on workloads that iterate many items (e.g., the 1,000-item test improved ~49%).

What changed
- Combined two checks into one short‑circuiting conditional:
  - Old: check hasattr(item, "rendermode") first, continue if present, then isinstance(item, LTChar) before assignment.
  - New: if isinstance(item, LTChar) and not hasattr(item, "rendermode"): assign.
- Removed the explicit continue and the separate hasattr branch; the logic is identical from a correctness perspective.

Why this speeds things up
- Fewer attribute lookups per loop iteration. In Python attribute access (hasattr / attribute lookup) is relatively expensive. The original code executed hasattr(item, "rendermode") for every item, even for non-LTChar objects. By testing isinstance(item, LTChar) first, the hasattr call is avoided for non-LTChar items (the common case), so we save an attribute lookup per non-LTChar item.
- Short‑circuit evaluation reduces bytecode and branching overhead (no separate continue branch).
- These savings compound in the hot path: this method is invoked from do_TJ and do_Tj while parsing text, so it runs many times per page. The profiler and tests show the loop-level savings lead to measurable end-to-end runtime improvement.

Profiler & test evidence
- Overall profiler total time dropped from 0.001776s → 0.001371s.
- The large-scale test (1000 LTChar objects) went from 113μs → 76.2μs (~49.5% faster), demonstrating the optimization scales with number of items.
- Many unit/test cases also show smaller but consistent improvements (see annotated_tests). A single small regression was observed in one micro-benchmark (+~5% in a very narrow case where items are always LTChar and already patched), which is an acceptable trade-off given the overall runtime/throughput gains.

Impact on workloads
- Best for PDFs with many text items or mixed-type _objs lists (many non-LTChar items): savings are greatest because we avoid unnecessary hasattr calls for non-LTChar entries.
- Safe to merge: behavior is preserved (pre-existing rendermode attributes are still honored), and dependencies/visibility are unchanged.

Summary
- Primary benefit: reduced runtime (35% overall, large improvements on heavier inputs).
- Cause: eliminated redundant attribute checks via a single short‑circuiting conditional, lowering per-item overhead in a hot path.
- Trade-off: negligible; one tiny micro-benchmark showed a small regression, but the overall throughput and real-workload performance improved significantly.
Track last-patched index so each do_TJ/do_Tj call only processes
newly-added LTChar objects instead of re-scanning the entire _objs list.
Resets automatically when cur_item changes (new page or figure).
Runtime improvement (primary): The optimized function runs ~7% faster overall (181 μs → 168 μs). That runtime reduction is the reason this change was accepted.

What changed (concrete optimizations)
- Single-step cur_item lookup: replaced "hasattr(self.device, 'cur_item') and self.device.cur_item" with cur_item = getattr(self.device, 'cur_item', None) and an early return. This reduces two attribute lookups into one and short-circuits the common no-cur-item case.
- Single getattr for _objs: replaced the conditional expression cur_item._objs if hasattr(cur_item, "_objs") else [] with objs = getattr(cur_item, "_objs", []). That avoids an extra hasattr call and repeated attribute access.
- Avoid repeated len/index work: replaced index-based loop for i in range(start, len(objs)): obj = objs[i] with direct iteration over the sub-list objs[start:] (for obj in objs[start:]). Also added an explicit if start < len(objs): guard so we don't build a slice when there's nothing to process.

Why these changes speed things up (mechanics)
- Fewer Python-level attribute lookups: getattr replaces two operations (hasattr + attribute fetch) with one, and caching cur_item and objs removes repeated attribute access inside the hot path. Attribute lookup in Python is comparatively expensive, so reducing them reduces per-call overhead.
- Reduced indexing overhead: using direct iteration avoids repeated integer arithmetic and two list index operations per iteration (compute index, __getitem__). That lowers per-iteration Python overhead inside the tight loop that checks and patches LTChar objects.
- Early-return common negative case: when device.cur_item is missing/falsy the function now returns earlier with less work, which helps where many interpreter calls have no cur_item.
- Guarding slice creation: creating objs[start:] can allocate a new list. The added start < len(objs) check prevents unnecessary allocations on the common empty/no-new-items case; when the slice is needed it's typically small (we only process newly appended items), so the copy cost is minimal compared to saved per-iteration overhead.

Why this matters in context (hot path)
- This method is invoked from do_TJ and do_Tj (text painting operators) in the interpreter. Those are hot paths during PDF text processing: the method can be called many times per page. Small per-call savings accumulate, so a ~7% per-call runtime improvement is meaningful.

Behavioral and workload impact (based on tests)
- Regression tests show the biggest wins on heavy workloads: large-scale patching and repeated-appends (1000-item and many-iteration cases) observe sizable reductions (e.g., ~16% and ~26% in annotated tests). Those are precisely the cases where per-iteration overhead dominates.
- Some tiny regressions in a few microbench tests (single small calls) are visible in the annotated tests (sub-microsecond differences or small percent slowdowns). These are minor and expected trade-offs for the net runtime improvement across typical workloads.
- Memory trade-off: objs[start:] creates a shallow copy of the sub-list. In typical usage this sub-list is small (only newly appended items) so the allocation cost is small and outweighed by per-iteration savings. If you have pathological cases where you repeatedly slice very large tails, that could increase temporary memory pressure — but tests and typical interpreter usage show net benefit.

Correctness
- The logic is preserved: render_mode is applied only to new LTChar objects and _last_patched_idx is updated the same way. Using getattr defaults keeps previous behavior for missing attributes.

Summary
- Net effect: fewer attribute lookups, less indexing overhead, and an early-exit optimize the common and hot cases encountered during PDF text interpretation, producing the measured 7% runtime speedup. The small memory/alloc cost of slicing is guarded and, in practice, outweighed by the reduced CPU overhead on the hot path (do_TJ/do_Tj).
…erpreter._patch_current_chars_with_render_mode-mm3lbz82

⚡️ Speed up method `CustomPDFPageInterpreter._patch_current_chars_with_render_mode` by 7%
Use idiomatic slice iteration instead of index-based loop, remove
redundant guard and trailing return.

qued

The _last_patched_idx approach overwrites previously patched chars
when cur_item reverts after a figure with text ops. Instead, each
do_TJ/do_Tj snapshots len(objs) before super() and only patches
from that index.

@KRRT7

@qued

pdfminer's base do_Tj delegates to self.do_TJ([s]), which already
dispatches to the overridden do_TJ. The do_Tj override was patching
the same char range a second time.

Repro (add print traces to do_TJ/do_Tj, run against any PDF):

    from unstructured.partition.pdf_image.pdfminer_utils import open_pdfminer_pages_generator

    with open("example-docs/pdf/reliance.pdf", "rb") as f:
        for page, layout in open_pdfminer_pages_generator(f):
            break

Before this fix, every Tj op produces two patch calls with the same
start index:

    [TRACE] do_TJ patching from 9
    [TRACE] do_Tj patching from 9   <- redundant

@qued

@KRRT7

…r-branch

Merge changes from upstream

qued

qued approved these changes Feb 27, 2026

Merged via the queue into Unstructured-IO:main with commit aef3bc4

Feb 27, 2026

54 of 55 checks passed

@qued qued deleted the codeflash/optimize-CustomPDFPageInterpreter._patch_current_chars_with_render_mode-mm3h21a8 branch

February 27, 2026 17:19

@qued qued mentioned this pull request

Feb 27, 2026