push_to_hub() for videos by lhoestq · Pull Request #7971 · huggingface/datasets
possible now that row group sizes are auto-determined based on the content size after #7589
Videos are uploaded as PLAIN in Parquet to make sure they can be seeked remotely and with random access to frames in #7976
In the future it could be cool to have the same behavior as when videos are separate files, i.e. lazily load them instead of downloading them completely in streaming mode.
Right now there is this discrepency:
load_dataset("username/my-folder-of-videos", streaming=True)-> videos are lazy loaded one by one when iterating, and only actually downloaded when accessing frames intorchcodecload_dataset("username/my-video-dataset-in-parquet", streaming=True)-> videos are downloaded one by one when iterating, even if no frame is accessed intorchcodec
close #7493