feat: add create_file_from_elements() to re-create document files from elements by claytonlin1110 · Pull Request #4259 · Unstructured-IO/unstructured
@PastelStorm have a chance to review?
Review: create_file_from_elements()
Nice feature — having a unified "elements back to document" API is a good addition. A few things to address before merging:
High — HTML mode silently drops content with default settings
create_file_from_elements() defaults no_group_by_page=False, but the underlying HTML converter (_elements_to_html_tags_by_page) skips any element that lacks metadata.page_number:
# unstructured/partition/html/convert.py, lines 294-299 for element in unstructured_elements: page_number = element.metadata.page_number if page_number is None: logger.warning(f"Page number is not set for an element {element.id}. Skipping.") continue pages_dict[page_number].append(element)
For many partition sources (e.g. non-paginated documents, or elements that have been manually constructed/modified), this produces nearly empty HTML while the API promises to "re-create a document file." The new test sidesteps this entirely by passing no_group_by_page=True.
Suggestion: Default no_group_by_page=True in create_file_from_elements(), since a round-trip function should preserve all elements by default. If page grouping is desired, callers can opt in explicitly.
Medium — assert used for runtime type narrowing
content = elements_to_text(elements, filename=None, encoding=encoding) assert content is not None # we passed filename=None
assert statements are stripped under python -O. In optimized mode this silently lets None through despite the -> str return type. Either add a proper runtime guard:
if content is None: raise RuntimeError("elements_to_text returned None unexpectedly")
Or skip the Optional[str] fight altogether and call convert_to_text(elements) directly, which always returns str.
Medium — Test coverage gaps
- The HTML test forces
no_group_by_page=True, so the default (and riskier) code path is never exercised. A test with the default settings on elements that lackpage_numberwould catch the silent-drop behavior above. - The
textandhtmltests don't exercise thefilenamewrite path. Since the markdown test does, this looks like an oversight — one parameterized test across all three formats would cover it cleanly. - No coverage for
exclude_binary_image_datapassthrough orformatcase/whitespace normalization (e.g." Markdown "→"markdown").
Low — Format-specific params as top-level arguments
no_group_by_page only applies to HTML. exclude_binary_image_data is irrelevant for text. Surfacing every format-specific knob at the top level won't scale if more options are added. Consider **kwargs pass-through to the underlying converter, or at minimum document which params apply to which formats.
Low — format shadows the built-in
The parameter name format shadows Python's built-in format(). Not a bug here, but easy to avoid (e.g. output_format) and removes a common lint warning.
Low — Write-path duplication
create_file_from_elements() re-implements file writing (open(...).write(...)) rather than delegating to the format-specific functions that already support filename=. This is fine today, but increases drift risk if encoding or output semantics evolve in the individual converters.
What I checked
| Area | Result |
|---|---|
| Duplicated code | Minor (write-path), not severe |
| Dead code | None in this diff |
| Incorrect mock usage | No mocking introduced |
| Race conditions | None — straightforward single-file writes |
| Useless tests | Not useless, but the HTML test bypasses the critical default path |
Summary
The main thing to fix is the HTML default — a function called "create file from elements" shouldn't silently discard elements out of the box. The assert should also be replaced with a proper guard. The rest are smaller improvements that would be nice to clean up while you're in here.