feat: add create_file_from_elements() to re-create document files from elements by claytonlin1110 · Pull Request #4259 · Unstructured-IO/unstructured

@PastelStorm have a chance to review?

Review: create_file_from_elements()

Nice feature — having a unified "elements back to document" API is a good addition. A few things to address before merging:


High — HTML mode silently drops content with default settings

create_file_from_elements() defaults no_group_by_page=False, but the underlying HTML converter (_elements_to_html_tags_by_page) skips any element that lacks metadata.page_number:

# unstructured/partition/html/convert.py, lines 294-299
for element in unstructured_elements:
    page_number = element.metadata.page_number
    if page_number is None:
        logger.warning(f"Page number is not set for an element {element.id}. Skipping.")
        continue
    pages_dict[page_number].append(element)

For many partition sources (e.g. non-paginated documents, or elements that have been manually constructed/modified), this produces nearly empty HTML while the API promises to "re-create a document file." The new test sidesteps this entirely by passing no_group_by_page=True.

Suggestion: Default no_group_by_page=True in create_file_from_elements(), since a round-trip function should preserve all elements by default. If page grouping is desired, callers can opt in explicitly.


Medium — assert used for runtime type narrowing

content = elements_to_text(elements, filename=None, encoding=encoding)
assert content is not None  # we passed filename=None

assert statements are stripped under python -O. In optimized mode this silently lets None through despite the -> str return type. Either add a proper runtime guard:

if content is None:
    raise RuntimeError("elements_to_text returned None unexpectedly")

Or skip the Optional[str] fight altogether and call convert_to_text(elements) directly, which always returns str.


Medium — Test coverage gaps

  • The HTML test forces no_group_by_page=True, so the default (and riskier) code path is never exercised. A test with the default settings on elements that lack page_number would catch the silent-drop behavior above.
  • The text and html tests don't exercise the filename write path. Since the markdown test does, this looks like an oversight — one parameterized test across all three formats would cover it cleanly.
  • No coverage for exclude_binary_image_data passthrough or format case/whitespace normalization (e.g. " Markdown ""markdown").

Low — Format-specific params as top-level arguments

no_group_by_page only applies to HTML. exclude_binary_image_data is irrelevant for text. Surfacing every format-specific knob at the top level won't scale if more options are added. Consider **kwargs pass-through to the underlying converter, or at minimum document which params apply to which formats.


Low — format shadows the built-in

The parameter name format shadows Python's built-in format(). Not a bug here, but easy to avoid (e.g. output_format) and removes a common lint warning.


Low — Write-path duplication

create_file_from_elements() re-implements file writing (open(...).write(...)) rather than delegating to the format-specific functions that already support filename=. This is fine today, but increases drift risk if encoding or output semantics evolve in the individual converters.


What I checked

Area Result
Duplicated code Minor (write-path), not severe
Dead code None in this diff
Incorrect mock usage No mocking introduced
Race conditions None — straightforward single-file writes
Useless tests Not useless, but the HTML test bypasses the critical default path

Summary

The main thing to fix is the HTML default — a function called "create file from elements" shouldn't silently discard elements out of the box. The assert should also be replaced with a proper guard. The rest are smaller improvements that would be nice to clean up while you're in here.