push_to_hub() for videos by lhoestq · Pull Request #7971 · huggingface/datasets

possible now that row group sizes are auto-determined based on the content size after #7589

Videos are uploaded as PLAIN in Parquet to make sure they can be seeked remotely and with random access to frames in #7976

In the future it could be cool to have the same behavior as when videos are separate files, i.e. lazily load them instead of downloading them completely in streaming mode.

Right now there is this discrepency:

  • load_dataset("username/my-folder-of-videos", streaming=True) -> videos are lazy loaded one by one when iterating, and only actually downloaded when accessing frames in torchcodec
  • load_dataset("username/my-video-dataset-in-parquet", streaming=True) -> videos are downloaded one by one when iterating, even if no frame is accessed in torchcodec

close #7493