feat: Add metadata-only replace API to Table for REPLACE snapshot operations by qzyu999 · Pull Request #3131 · apache/iceberg-python
Closes #3130
Rationale for this change
In a current PR (#3124, part of #1092), the proposed replace() API accepts a PyArrow dataframe (pa.Table), forcing the table engine to physically serialize data during a metadata transaction commit. This couples execution with the catalog, diverges from Java Iceberg's native RewriteFiles builder behavior, and fails to register under Operation.REPLACE.
This PR redesigns table.replace() and transaction.replace() to accept Iterable[DataFile] inputs. By externalizing physical data writing (e.g., compaction via Ray), the new explicit metadata-only _RewriteFiles SnapshotProducer can natively swap snapshot pointers in the manifests, perfectly inheriting ancestral sequence numbers for DELETED entries to ensure time-travel equivalence.
Are these changes tested?
Yes.
Fully exhaustive test coverage has been added to tests/table/test_replace.py. The suite validates:
- Context manager executions tracking valid history growth (
len(table.history())). - Snapshot summary bindings asserting strict
Operation.REPLACEtags. - Accurate evaluation of delta-metrics (added/deleted files and records tracking perfectly).
- Low-level serialization: Bypassed high-level discard filters on
manifest.fetch_manifest_entry(discard_deleted=False)to natively assert thatstatus=DELETEDoverrides are accurately preserving avro sequence numbers. - Idempotent edge cases where
replace([], [])successfully short-circuits the commit loop without mutating history.
Are there any user-facing changes?
Yes.
The method signature for Table.replace() and Transaction.replace() has been updated from the original PR #3124.
It no longer accepts a PyArrow DataFrame (df: pa.Table). Instead, it now requests two arguments:
files_to_delete: Iterable[DataFile] and files_to_add: Iterable[DataFile], following the convention seen in the Java implementation.
(Please add the changelog label)
AI Disclosure
AI was used to help understand the code base and draft code changes. All code changes have been thoroughly reviewed, ensuring that the code changes are in line with a broader understanding of the codebase.
- Worth deeper review after AI-assistance:
- The
test_invalid_operation()intests/table/test_snapshots.pypreviously usedOperation.REPLACEas a value to test invalid operations, but with this changeOperation.REPLACEbecomes valid. In place I just put a dummy Operation. - The
_RewriteFilesinpyiceberg/table/update/snapshot.pyoverrides the_deleted_entriesand_existing_manifestsfunctions. I sought to test this thoroughly that it was done correctly. I am thinking it's possible to improve the test suite to make this more rigorous. I am open to suggestions on how that could be done.