feat: use content defined chunking by kszucs · Pull Request #7589

feat: use content defined chunking by kszucs · Pull Request #7589 · huggingface/datasets

Use content defined chunking by default when writing parquet files.

set the parameters in io.parquet.ParquetDatasetReader
set the parameters in arrow_writer.ParquetWriter

It requires a new pyarrow pin ">=21.0.0" which is released now.

kszucs changed the title ~~feat: use content defined chunking in io.parquet.ParquetDatasetReader~~ feat: use content defined chunking

May 29, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

kszucs marked this pull request as ready for review

July 25, 2025 10:59

		# Batch size constants. For more info, see:
		# https://github.com/apache/arrow/blob/master/docs/source/cpp/arrays.rst#size-limitations-and-recommendations)
		DEFAULT_MAX_BATCH_SIZE = 1000
		DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default arrow row group size. If we choose a too small row group size then we cannot profit from CDC chunking that much.

Need to set DEFAULT_MAX_BATCH_SIZE = 1024 * 1024

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

maybe we'll need to auto-tweak the row group size to aim for a [30MB-300MB] interval, or we can end up with multiple GBs row groups

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

would it make sense to use the default row group size, and expect the readers will rely on the pages index to fetch only the required bits? Not sure if it exists in duckdb.

most frameworks read row group by row group, that's why we need them to be of reasonable size anyways

We should consider enabling page indexes by default when writing parquet files to enable page pruning readers like the next dataset viewer huggingface/dataset-viewer#3199

~~where would the page indexes be stored? in the custom section in the Parquet file metadata? Is it standardized or ad hoc?~~

OK, I just RTFM:

write_page_index: bool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

I updated the PR to write row groups of 100MB (uncompressed) instead of relying on a bigger default number of rows per row group. Let me know what you think :)

This way it should provide good performance for CDC while letting the Viewer work correctly without OOM.

Have you tried on some datasets? I think it's important to choose a good default, as it will impact all the datasets, and it costs a lot to recompute if we need to. Not saying the value is bad, I don't know, but it would be good to have some validation before setting it.

I will also run the estimator to check the impact on deduplication. In general smaller row groups decrease the deduplication efficiency, so ideally we should use bigger row groups with page indexes.

BTW could we use smaller row groups only for refs/convert/parquet?

I reproduced the example in the blog post.

It uploads 60MB of new data for each shard of 210MB of filtered OpenHermes2.5 (after shuffling).

The baseline is 20MB for one single-row-group shard with pandas (unshuffled in the blog post), which is not impacted by file boundaries and row group boundaries.

(fyi increasing the row group size from 100MB to 200MB gives roughly the same result of 60MB of new data per shard of 210MB)

I just added write_page_index=True as well. It didn't affect dedupe perf on the OpenHermes example.

Merging now :) Let me know if there are other things to change before I can do a new release