Releases · huggingface/datasets
4.6.1
4.6.0
Dataset Features
-
Support Image, Video and Audio types in Lance datasets
>>> from datasets import load_dataset >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train") >>> ds.features {'video_blob': Video(), 'video_path': Value('string'), 'caption': Value('string'), 'aesthetic_score': Value('float64'), 'motion_score': Value('float64'), 'temporal_consistency_score': Value('float64'), 'camera_motion': Value('string'), 'frame': Value('int64'), 'fps': Value('float64'), 'seconds': Value('float64'), 'embedding': List(Value('float32'), length=1024)}
-
Push to hub now supports Video types
>>> from datasets import Dataset, Video >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]}) >>> ds = ds.cast_column("video", Video()) >>> ds.push_to_hub("username/my-video-dataset")
-
Write image/audio/video blobs as is in parquet (PLAIN) in
push_to_hub()by @lhoestq in #7976- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication
-
Add
IterableDataset.reshard()by @lhoestq in #7992Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
>>> from datasets import load_dataset >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True) >>> ds IterableDataset({ features: ['label', 'title', 'content'], num_shards: 4 }) >>> ds.reshard() IterableDataset({ features: ['label', 'title', 'content'], num_shards: 3600 })
What's Changed
- Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
- Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
- docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
- Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
- Add examples for Lance datasets by @prrao87 in #7950
- Support null in json string cols by @lhoestq in #7963
- handle blob lance by @lhoestq in #7964
- Count examples in lance by @lhoestq in #7969
- Use temp files in push_to_hub to save memory by @lhoestq in #7979
- Drop python 3.9 by @lhoestq in #7980
- Support pandas 3 by @lhoestq in #7981
- Remove unused data files optims by @lhoestq in #7985
- Remove pre-release workaround in CI for
transformers v5andhuggingface_hub v1by @hanouticelina in #7989 - very basic support for more hf urls by @lhoestq in #8003
- Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
- Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
- More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
- Support empty shard in from_generator by @lhoestq in #8023
- Allow import polars in map() by @lhoestq in #8024
New Contributors
- @omarfarhoud made their first contribution in #7919
- @Edge-Explorer made their first contribution in #7960
- @prathamk-tw made their first contribution in #7955
- @prrao87 made their first contribution in #7950
- @hanouticelina made their first contribution in #7989
- @jayzuccarelli made their first contribution in #7995
- @AnkitAhlawat7742 made their first contribution in #8000
Full Changelog: 4.5.0...4.6.0
4.5.0
Dataset Features
-
Add lance format support by @eddyxu in #7913
- Support for both Lance dataset (including metadata / manifests) and standalone .lance files
- e.g. with lance-format/fineweb-edu
from datasets import load_dataset ds = load_dataset("lance-format/fineweb-edu", streaming=True) for example in ds["train"]: ...
What's Changed
- Raise early for invalid
revisioninload_datasetby @Scott-Simmons in #7929 - fix low but large example indexerror by @CloseChoice in #7912
- Fix method to retrieve attributes from file object by @lhoestq in #7938
- add _OverridableIOWrapper by @lhoestq in #7942
- Add _generate_shards by @lhoestq in #7943
New Contributors
- @eddyxu made their first contribution in #7913
- @Scott-Simmons made their first contribution in #7929
Full Changelog: 4.4.2...4.5.0
4.4.2
Bug fixes
- Fix embed storage nifti by @CloseChoice in #7853
- ArXiv -> HF Papers by @qgallouedec in #7855
- fix some broken links by @julien-c in #7859
- Nifti visualization support by @CloseChoice in #7874
- Replace papaya with niivue by @CloseChoice in #7878
- Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter by @sajmaru in #7884
- fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in #7891
- encode nifti correctly when uploading lazily by @CloseChoice in #7892
- fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in #7887
Minor additions
- Add type overloads to load_dataset for better static type inference by @Aditya2755 in #7888
- Add inspect_ai eval logs support by @lhoestq in #7899
- Save input shard lengths by @lhoestq in #7897
- Don't save original_shard_lengths by default for backward compat by @lhoestq in #7906
New Contributors
- @sajmaru made their first contribution in #7884
- @Aditya2755 made their first contribution in #7888
- @ada-ggf25 made their first contribution in #7891
- @The-Obstacle-Is-The-Way made their first contribution in #7887
Full Changelog: 4.4.1...4.4.2
4.4.1
4.4.0
Dataset Features
-
Add nifti support by @CloseChoice in #7815
- Load medical imaging datasets from Hugging Face:
ds = load_dataset("username/my_nifti_dataset") ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
- Load medical imaging datasets from your disk:
files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"] ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti()) ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
- Documentation: https://huggingface.co/docs/datasets/nifti_dataset
-
Add num channels to audio by @CloseChoice in #7840
# samples have shape (num_channels, num_samples) ds = ds.cast_column("audio", Audio()) # default, use all channels ds = ds.cast_column("audio", Audio(num_channels=2)) # use stereo ds = ds.cast_column("audio", Audio(num_channels=1)) # use mono
What's Changed
- Fix random seed on shuffle and interleave_datasets by @CloseChoice in #7823
- fix ci compressionfs by @lhoestq in #7830
- fix: better args passthrough for
_batch_setitems()by @sghng in #7817 - Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in #7833
- resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in #7831
- fix column with transform by @lhoestq in #7843
- support fsspec 2025.10.0 by @lhoestq in #7844
New Contributors
- @sghng made their first contribution in #7817
- @art-test-stack made their first contribution in #7833
Full Changelog: 4.3.0...4.4.0
4.3.0
4.2.0
Dataset Features
-
Sample without replacement option when interleaving datasets by @radulescupetru in #7786
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
-
Parquet: add
on_bad_filesargument to error/warn/skip bad files by @lhoestq in #7806ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
-
Add parquet scan options and docs by @lhoestq in #7801
- docs to select columns and filter data efficiently
ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"]) ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
- new argument to control buffering and caching when streaming
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20)) ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
What's Changed
- Document HDF5 support by @klamike in #7740
- update tips in docs by @lhoestq in #7790
- feat: avoid some copies in torch formatter by @drbh in #7787
- Support huggingface_hub v0.x and v1.x by @Wauplin in #7783
- Define CI future by @lhoestq in #7799
- More Parquet streaming docs by @lhoestq in #7803
- Less api calls when resolving data_files by @lhoestq in #7805
- typo by @lhoestq in #7807
New Contributors
Full Changelog: 4.1.1...4.2.0
4.1.1
4.1.0
Dataset Features
-
feat: use content defined chunking by @kszucs in #7589
- Parquet datasets are now Optimized Parquet !
- internally uses
use_content_defined_chunking=Truewhen writing Parquet files - this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking ds.push_to_hub("username/dataset_name")
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
write_page_index=Trueis also used to enable fast random access for the Dataset Viewer and tools that need it
-
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
ds = load_dataset("username/dataset-with-hdf5-files")
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows
Other improvements and bug fixes
- Convert to string when needed + faster .zstd by @lhoestq in #7683
- fix audio cast storage from array + sampling_rate by @lhoestq in #7684
- Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
- Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
- Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
- Update dataset_dict push_to_hub by @lhoestq in #7711
- Retry intermediate commits too by @lhoestq in #7712
- num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
- Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
- fix num_proc=1 ci test by @lhoestq in #7714
- Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
- typo by @lhoestq in #7716
- fix largelist repr by @lhoestq in #7735
- Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
- Fix type hint
train_test_splitby @qgallouedec in #7736 - fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
- Refactor HDF5 and preserve tree structure by @klamike in #7743
- docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
- Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
- Support pathlib.Path for feature input by @Joshua-Chin in #7755
- add support for pyarrow string view in features by @onursatici in #7718
- Fix typo in error message for cache directory deletion by @brchristian in #7749
- update torchcodec in ci by @lhoestq in #7764
- Bump dill to 0.4.0 by @Bomme in #7763
New Contributors
- @DavidRConnell made their first contribution in #7438
- @rootAvish made their first contribution in #7701
- @tanuj-rai made their first contribution in #7702
- @evalstate made their first contribution in #7713
- @brchristian made their first contribution in #7730
- @klamike made their first contribution in #7690
- @YassineYousfi made their first contribution in #7726
- @Sanjaykumar030 made their first contribution in #7737
- @kszucs made their first contribution in #7589
- @Joshua-Chin made their first contribution in #7755
- @onursatici made their first contribution in #7718
- @Bomme made their first contribution in #7763
Full Changelog: 4.0.0...4.1.0