Add lance format support by eddyxu · Pull Request #7913

Add lance format support by eddyxu · Pull Request #7913 · huggingface/datasets

Add lance format as one of the packaged_modules.

import datasets

ds = datasets.load_dataset("org/lance_repo", split="train")

# Or

ds = datasets.load_dataset("./local/data.lance")

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cool ! I notice the current implementation doesn't support streaming because of the symlink hack.

I believe you can do something like this instead:

def _generate_tables(self, paths: list[str]):
    for path in paths:
        ds = lance.dataset(path)
        for frag_idx, fragment in enumerate(ds.get_fragments()):
            for batch_idx, batch in enumerate(
                fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
            ):
                table = pa.Table.from_batches([batch])
                table = self._cast_table(table)
                yield Key(frag_idx, batch_idx), table

note that path can be a local one, but also a hf:// URI

I took the liberty to make a few changes :)

Now I believe we should be good:

both local and streaming work fine
both dataset and single files work fine
all files are properly downloaded now than all files and metadata files are included in config.data_files
sharding is supported:
- dataset: one shard = one fragment
- single files: one shard = one file
streaming dataset resuming works fine thanks to Key()
the two hacks are visible and with TODOs to remove them when possible
1. remove the revision in HF uris since only "main" is supported
2. write proper _version/* files since lance doesn't work if they are symlinks

I think this PR is ready, just let me know what you think before we merge 🚀

The next steps are:

open a PR in this repository to document Lance support in datasets
open a PR in https://github.com/huggingface/hub-docs to add pylance to the list of integrated library on HF, and have some documentation on how to use it with datasets on HF (here is an example PR)
open a PR in https://github.com/huggingface/huggingface.js to add Lance as a supported dataset library on the HF website (here is an example PR)

Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :)

And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF !