feat: tablechunks can reconstruct table by qued · Pull Request #4291 · Unstructured-IO/unstructured

and others added 4 commits

March 23, 2026 18:39
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor[bot]

@qued @claude

lxml.append() moves elements, disrupting the lazy iterator. Using
list() materializes all rows before the loop so none are skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor[bot]

qued and others added 3 commits

March 23, 2026 19:15
When a table has HTML but takes the text-only chunking path (due to
small hard_max), the deep-copied metadata retained the full original
text_as_html. This would cause reconstruct_table_from_chunks to
duplicate rows. Now explicitly set text_as_html=None for text-only
chunks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hunks

Replace the parent_id linked list approach for table reconstruction with
explicit chunk sequencing metadata per ML-1020:
- table_id: shared UUID for all chunks from the same table
- chunk_index: 0-based position in the chunk sequence
- total_chunks: total number of chunks for the table

Update reconstruct_table_from_chunks to group by table_id and order by
chunk_index instead of walking parent_id chains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cragwolfe added a commit that referenced this pull request

Mar 25, 2026
Fixes high-severity table reconstruction corruption by merging only top-level rows from each chunk's table HTML, preventing nested table rows from being hoisted.

Adds regression coverage for nested-table HTML reconstruction.

Finding reference: #4291 (comment)

qued and others added 5 commits

March 25, 2026 17:42
Remove total_chunks per business guidance. Keep table_id and
chunk_index for table reconstruction. Add tests verifying chunk
sequencing metadata is set correctly on split tables and absent
on unsplit tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor[bot]

@cragwolfe

@cragwolfe

cragwolfe

@qued qued deleted the ml-1016/tablechunks-can-reconstruct-table branch

March 26, 2026 17:28

github-merge-queue bot pushed a commit that referenced this pull request

Mar 26, 2026
## Summary
- Fix `_merge_table_chunks()` to merge only top-level rows from each
chunk HTML table.
- Prevent nested table rows from being hoisted into the reconstructed
root table.
- Add regression coverage to verify nested table structure is preserved.

## Finding Reference
-
#4291 (comment)

## Validation
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/chunking/test_base.py -k
"reconstruct_tables_from_a_mixed_element_list or
preserves_nested_table_structure" --maxfail=1`
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/chunking/test_base.py
test_unstructured/chunking/test_dispatch.py --maxfail=1`
- `unset VIRTUAL_ENV && uv run --no-sync python - <<'PY'
from unstructured.partition.text import partition_text

elements = partition_text(text="Codex initializer smoke test")
assert elements, "partition_text returned no elements"
print(f"partition_text smoke check passed ({len(elements)} elements)")
PY`
- `unset VIRTUAL_ENV && CI=false uv run --no-sync pytest -q
test_unstructured/partition/test_text.py --maxfail=1`

authored by codex