Optimization: Plugin MergeManifest into snapshot overwrite operation by gabeiglio · Pull Request #3103 · apache/iceberg-python
Rationale for this change
Following from the optimizations when benchmarking Overwrites, I noticed that in each full overwrite of a partition we would linearly increment the number of manifest files for that partition, even though, only one of those manifests contained the data. And, therefore, each new overwrite would take a bit longer in each iteration.
For example, doing 20 overwrite iterations of full overwrite to a partition would look like
> select partition_summaries from default.table.manifests where partition_summaries[0]['lower_bound'] = 20250101;
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
...
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
Time taken: 0.137 seconds, Fetched 21 row(s)
WIth merge enabled for overwrites, 20 iterations would leave:
> select partition_summaries from default.table.manifests where partition_summaries[0]['lower_bound'] = 20250101;
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
[{"contains_null":false,"contains_nan":false,"lower_bound":"20250101","upper_bound":"20250101"}]
Time taken: 0.137 seconds, Fetched 2 row(s)
So there are two changes being made in this PR:
- Make _OverwriteFiles merge manifests before committing
- Filter out manifests while merging that contain no live data
Are these changes tested?
Created tests/integration/test_writes/test_manifest_merging.py
with three integration tests testing the number of manifests of overwrites and appends (with and without manifests merging enabled) to test for data correctness and number of manifest.
Are there any user-facing changes?
User will potentially see less manifests as a result of overwrite operations