Releases · Unstructured-IO/unstructured
0.22.16
0.22.16
Enhancements
- Formula markdown export (
element_to_md/elements_to_md): New keyword-onlyformula_markdown_style("auto","display_math","plain"; default"auto"). In"auto", display math ($$ ... $$) is used only when the text looks like notation (heuristic score) and contains no$/$$(avoids breaking Markdown and noisy OCR captions)."display_math"wraps whenever safe (still falls back to plain if$would corrupt fences)."plain"emits text only. Optionalnormalize_formula(defaultTrue) maps common Unicode operators to LaTeX-like tokens;normalize_formulastays before keyword-only options so positionalencoding/no_group_by_pagecallers are unchanged. Unicode√is never mapped to\\sqrt{}. Module constants:FORMULA_MARKDOWN_AUTO,FORMULA_MARKDOWN_DISPLAY_MATH,FORMULA_MARKDOWN_PLAIN.
0.22.15
Security
- security: fix(deps): upgrade vulnerable transitive dependencies [security]
0.22.14
Enhancements
- Deduplicate PDF rendering: Remove
_render_pdf_pagesand delegate tounstructured-inference'sconvert_pdf_to_image(which already has lazy per-page rendering). Peak memory forpath_only=Truedrops from O(n_pages) to O(1 page) — 97% reduction on a 100-page PDF. Bumps inference dep to>=1.6.2.
0.22.13
Enhancements
- Speed up
standardize_quotes: Replace loop-based character replacement with a singlestr.translate()call using a pre-computed translation table. Also fixes a pre-existing bug where left smart quotes were never normalized due to duplicate dictionary keys.
0.22.12
What's Changed
- mem: exclude unused spaCy pipeline components to reduce model memory by @KRRT7 in #4296
- fix: pdfminer drops extractable text by @qued in #4310
Full Changelog: 0.22.10...0.22.12
0.22.10
0.22.6
What's Changed
- fix(deps): Update security updates [SECURITY] by @utic-renovate[bot] in #4303
- fix: Self-contained script for version extraction in release CI by @vladimir-kivi-ds in #4304
Full Changelog: 0.22.4...0.22.6
0.22.4
What's Changed
- feat: add create_file_from_elements() to re-create document files from elements by @claytonlin1110 in #4259
- Bump dependencies by @PastelStorm in #4265
- fix: avoid O(N²) re-scanning in _patch_current_chars_with_render_mode by @KRRT7 in #4266
- add check if libmagic fails by @aadland6 in #4273
- Adds Form Element by @aadland6 in #4272
- feat: audio speech to text partition by @claytonlin1110 in #4264
- Add a check for complex pdfs by @aadland6 in #4268
- chore: disable fail-build on Anchore container scan by @lawrence-u10d in #4285
- feat: make telemetry off by default by @claytonlin1110 in #4281
- fix(deps): Update security vulnerability in pypdf to v6.9.1 [SECURITY] by @utic-renovate[bot] in #4248
- feat: Store routing in ElementMetadata by @vladimir-kivi-ds in #4293
- feat: custom Markdown extensions for partition_md by @claytonlin1110 in #4292
- feat: tablechunks can reconstruct table by @qued in #4291
New Contributors
- @KRRT7 made their first contribution in #4266
- @vladimir-kivi-ds made their first contribution in #4293
Full Changelog: 0.21.5...0.22.4
0.21.5
0.21.2
fix: self-install pinned spaCy model at runtime with SHA256 verificat…
0.21.1
0.21.0
0.21.0
Fixes
- Replace NLTK with spaCy to remediate CVE-2025-14009: NLTK's downloader uses
zipfile.extractall()without path validation, enabling RCE via malicious packages (CVSS 10.0, no patch available). spaCy models install as pip packages, eliminating the vulnerable downloader entirely.