Releases · huggingface/datasets

4.6.1

4.6.0

Dataset Features

  • Support Image, Video and Audio types in Lance datasets

    >>> from datasets import load_dataset
    >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
    >>> ds.features
    {'video_blob': Video(),
     'video_path': Value('string'),
     'caption': Value('string'),
     'aesthetic_score': Value('float64'),
     'motion_score': Value('float64'),
     'temporal_consistency_score': Value('float64'),
     'camera_motion': Value('string'),
     'frame': Value('int64'),
     'fps': Value('float64'),
     'seconds': Value('float64'),
     'embedding': List(Value('float32'), length=1024)}
  • Push to hub now supports Video types

     >>> from datasets import Dataset, Video
    >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
    >>> ds = ds.cast_column("video", Video())
    >>> ds.push_to_hub("username/my-video-dataset")
  • Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in #7976

    • this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
    • E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
    • See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

image

  • Add IterableDataset.reshard() by @lhoestq in #7992

    Reshard the dataset if possible, i.e. split the current shards further into more shards.
    This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
    Equality may happen if no shard can be split further.

    The resharding mechanism depends on the dataset file format:

    • Parquet: shard per row group instead of per file
    • Other: not implemented yet (contributions are welcome !)
    >>> from datasets import load_dataset
    >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
    >>> ds
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 4
    })
    >>> ds.reshard()
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 3600
    })

What's Changed

New Contributors

Full Changelog: 4.5.0...4.6.0

4.5.0

Dataset Features

  • Add lance format support by @eddyxu in #7913

    • Support for both Lance dataset (including metadata / manifests) and standalone .lance files
    • e.g. with lance-format/fineweb-edu
    from datasets import load_dataset
    
    ds = load_dataset("lance-format/fineweb-edu", streaming=True)
    for example in ds["train"]:
        ...

What's Changed

New Contributors

Full Changelog: 4.4.2...4.5.0

4.4.2

Bug fixes

Minor additions

  • Add type overloads to load_dataset for better static type inference by @Aditya2755 in #7888
  • Add inspect_ai eval logs support by @lhoestq in #7899
  • Save input shard lengths by @lhoestq in #7897
  • Don't save original_shard_lengths by default for backward compat by @lhoestq in #7906

New Contributors

Full Changelog: 4.4.1...4.4.2

4.4.1

4.4.0

Dataset Features

  • Add nifti support by @CloseChoice in #7815

    • Load medical imaging datasets from Hugging Face:
    ds = load_dataset("username/my_nifti_dataset")
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
    • Load medical imaging datasets from your disk:
    files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
    ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
    ds["train"][0]  # {"nifti": <nibabel.nifti1.Nifti1Image>}
  • Add num channels to audio by @CloseChoice in #7840

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

What's Changed

New Contributors

Full Changelog: 4.3.0...4.4.0

4.3.0

4.2.0

Dataset Features

  • Sample without replacement option when interleaving datasets by @radulescupetru in #7786

    ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
  • Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in #7806

    ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
  • Add parquet scan options and docs by @lhoestq in #7801

    • docs to select columns and filter data efficiently
    ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
    ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
    • new argument to control buffering and caching when streaming
    fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
    ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

What's Changed

New Contributors

Full Changelog: 4.1.1...4.2.0

4.1.1

4.1.0

Dataset Features

  • feat: use content defined chunking by @kszucs in #7589

    image
    • internally uses use_content_defined_chunking=True when writing Parquet files
    • this enables fast deduped uploads to Hugging Face !
    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
    • with this change, the new default row group size for Parquet is set to 100MB
    • write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
  • Concurrent push_to_hub by @lhoestq in #7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in #7710

  • HDF5 support by @klamike in #7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: 4.0.0...4.1.0