Releases · huggingface/datasets

4.6.1

4.6.0 Dataset Features

Support Image, Video and Audio types in Lance datasets

Infer types from lance blobs by @lhoestq in #7966

>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
 'video_path': Value('string'),
 'caption': Value('string'),
 'aesthetic_score': Value('float64'),
 'motion_score': Value('float64'),
 'temporal_consistency_score': Value('float64'),
 'camera_motion': Value('string'),
 'frame': Value('int64'),
 'fps': Value('float64'),
 'seconds': Value('float64'),
 'embedding': List(Value('float32'), length=1024)}

Push to hub now supports Video types

push_to_hub() for videos by @lhoestq in #7971

 >>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")

Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in #7976
- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

Add IterableDataset.reshard() by @lhoestq in #7992

Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.

The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
```
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> ds.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})
```

What's Changed

Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
Add examples for Lance datasets by @prrao87 in #7950
Support null in json string cols by @lhoestq in #7963
handle blob lance by @lhoestq in #7964
Count examples in lance by @lhoestq in #7969
Use temp files in push_to_hub to save memory by @lhoestq in #7979
Drop python 3.9 by @lhoestq in #7980
Support pandas 3 by @lhoestq in #7981
Remove unused data files optims by @lhoestq in #7985
Remove pre-release workaround in CI for transformers v5 and huggingface_hub v1 by @hanouticelina in #7989
very basic support for more hf urls by @lhoestq in #8003
Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
Support empty shard in from_generator by @lhoestq in #8023
Allow import polars in map() by @lhoestq in #8024

New Contributors

@omarfarhoud made their first contribution in #7919
@Edge-Explorer made their first contribution in #7960
@prathamk-tw made their first contribution in #7955
@prrao87 made their first contribution in #7950
@hanouticelina made their first contribution in #7989
@jayzuccarelli made their first contribution in #7995
@AnkitAhlawat7742 made their first contribution in #8000

Full Changelog: 4.5.0...4.6.0

4.5.0 Dataset Features

Add lance format support by @eddyxu in #7913
- Support for both Lance dataset (including metadata / manifests) and standalone .lance files
- e.g. with lance-format/fineweb-edu
```
from datasets import load_dataset

ds = load_dataset("lance-format/fineweb-edu", streaming=True)
for example in ds["train"]:
    ...
```

What's Changed

Raise early for invalid revision in load_dataset by @Scott-Simmons in #7929
fix low but large example indexerror by @CloseChoice in #7912
Fix method to retrieve attributes from file object by @lhoestq in #7938
add _OverridableIOWrapper by @lhoestq in #7942
Add _generate_shards by @lhoestq in #7943

New Contributors

@eddyxu made their first contribution in #7913
@Scott-Simmons made their first contribution in #7929

Full Changelog: 4.4.2...4.5.0

4.4.2 Bug fixes

Fix embed storage nifti by @CloseChoice in #7853
ArXiv -> HF Papers by @qgallouedec in #7855
fix some broken links by @julien-c in #7859
Nifti visualization support by @CloseChoice in #7874
Replace papaya with niivue by @CloseChoice in #7878
Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter by @sajmaru in #7884
fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in #7891
encode nifti correctly when uploading lazily by @CloseChoice in #7892
fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in #7887

Minor additions

Add type overloads to load_dataset for better static type inference by @Aditya2755 in #7888
Add inspect_ai eval logs support by @lhoestq in #7899
Save input shard lengths by @lhoestq in #7897
Don't save original_shard_lengths by default for backward compat by @lhoestq in #7906

New Contributors

@sajmaru made their first contribution in #7884
@Aditya2755 made their first contribution in #7888
@ada-ggf25 made their first contribution in #7891
@The-Obstacle-Is-The-Way made their first contribution in #7887

Full Changelog: 4.4.1...4.4.2

4.4.1

4.4.0 Dataset Features

Add nifti support by @CloseChoice in #7815

Load medical imaging datasets from Hugging Face:

ds = load_dataset("username/my_nifti_dataset")
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Load medical imaging datasets from your disk:

files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}

Documentation: https://huggingface.co/docs/datasets/nifti_dataset

Add num channels to audio by @CloseChoice in #7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

Python 3.14 support by @lhoestq in #7836

What's Changed

Fix random seed on shuffle and interleave_datasets by @CloseChoice in #7823
fix ci compressionfs by @lhoestq in #7830
fix: better args passthrough for _batch_setitems() by @sghng in #7817
Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in #7833
resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in #7831
fix column with transform by @lhoestq in #7843
support fsspec 2025.10.0 by @lhoestq in #7844

New Contributors

@sghng made their first contribution in #7817
@art-test-stack made their first contribution in #7833

Full Changelog: 4.3.0...4.4.0

4.3.0

4.2.0 Dataset Features

Sample without replacement option when interleaving datasets by @radulescupetru in #7786

ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")

Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806
```
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
```

Add parquet scan options and docs by @lhoestq in #7801

docs to select columns and filter data efficiently

ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])

new argument to control buffering and caching when streaming

fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

Document HDF5 support by @klamike in #7740
update tips in docs by @lhoestq in #7790
feat: avoid some copies in torch formatter by @drbh in #7787
Support huggingface_hub v0.x and v1.x by @Wauplin in #7783
Define CI future by @lhoestq in #7799
More Parquet streaming docs by @lhoestq in #7803
Less api calls when resolving data_files by @lhoestq in #7805
typo by @lhoestq in #7807

New Contributors

@drbh made their first contribution in #7787

Full Changelog: 4.1.1...4.2.0

4.1.1

4.1.0 Dataset Features

feat: use content defined chunking by @kszucs in #7589
- Parquet datasets are now Optimized Parquet !
- internally uses use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
- write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
Concurrent push_to_hub by @lhoestq in #7708
Concurrent IterableDataset push_to_hub by @lhoestq in #7710
HDF5 support by @klamike in #7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in #7683
fix audio cast storage from array + sampling_rate by @lhoestq in #7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in #7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in #7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in #7701
Update dataset_dict push_to_hub by @lhoestq in #7711
Retry intermediate commits too by @lhoestq in #7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in #7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in #7713
fix num_proc=1 ci test by @lhoestq in #7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in #7715
typo by @lhoestq in #7716
fix largelist repr by @lhoestq in #7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in #7730
Fix type hint train_test_split by @qgallouedec in #7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in #7726
Refactor HDF5 and preserve tree structure by @klamike in #7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in #7737
Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in #7761
Support pathlib.Path for feature input by @Joshua-Chin in #7755
add support for pyarrow string view in features by @onursatici in #7718
Fix typo in error message for cache directory deletion by @brchristian in #7749
update torchcodec in ci by @lhoestq in #7764
Bump dill to 0.4.0 by @Bomme in #7763

New Contributors

@DavidRConnell made their first contribution in #7438
@rootAvish made their first contribution in #7701
@tanuj-rai made their first contribution in #7702
@evalstate made their first contribution in #7713
@brchristian made their first contribution in #7730
@klamike made their first contribution in #7690
@YassineYousfi made their first contribution in #7726
@Sanjaykumar030 made their first contribution in #7737
@kszucs made their first contribution in #7589
@Joshua-Chin made their first contribution in #7755
@onursatici made their first contribution in #7718
@Bomme made their first contribution in #7763

Full Changelog: 4.0.0...4.1.0