feat: use content defined chunking by kszucs · Pull Request #7589 · huggingface/datasets
Use content defined chunking by default when writing parquet files.
- set the parameters in
io.parquet.ParquetDatasetReader - set the parameters in
arrow_writer.ParquetWriter
It requires a new pyarrow pin ">=21.0.0" which is released now.
kszucs
changed the title
feat: use content defined chunking in
feat: use content defined chunking io.parquet.ParquetDatasetReader
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199
kszucs
marked this pull request as ready for review
| # Batch size constants. For more info, see: | ||
| # https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations) | ||
| DEFAULT_MAX_BATCH_SIZE = 1000 | ||
| DEFAULT_MAX_BATCH_SIZE = 1024 * 1024 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.
Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024
maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups
maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups
We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199
would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.
would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.
most frameworks read row group by row group, that's why we need them to be of reasonable size anyways
We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199
where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?
OK, I just RTFM:
write_page_index:bool, default FalseWhether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.
I updated the PR to write row groups of 100MB (uncompressed) instead of relying on a bigger default number of rows per row group. Let me know what you think :)
This way it should provide good performance for CDC while letting the Viewer work correctly without OOM.
Have you tried on some datasets? I think it's important to choose a good default, as it will impact all the datasets, and it costs a lot to recompute if we need to. Not saying the value is bad, I don't know, but it would be good to have some validation before setting it.
I will also run the estimator to check the impact on deduplication. In general smaller row groups decrease the deduplication efficiency, so ideally we should use bigger row groups with page indexes.
BTW could we use smaller row groups only for refs/convert/parquet?
I reproduced the example in the blog post.
It uploads 60MB of new data for each shard of 210MB of filtered OpenHermes2.5 (after shuffling).
The baseline is 20MB for one single-row-group shard with pandas (unshuffled in the blog post), which is not impacted by file boundaries and row group boundaries.
(fyi increasing the row group size from 100MB to 200MB gives roughly the same result of 60MB of new data per shard of 210MB)
I just added write_page_index=True as well. It didn't affect dedupe perf on the OpenHermes example.
Merging now :) Let me know if there are other things to change before I can do a new release
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters