Add lance format support by eddyxu · Pull Request #7913 · huggingface/datasets

@eddyxu

Add lance format as one of the packaged_modules.

import datasets

ds = datasets.load_dataset("org/lance_repo", split="train")

# Or

ds = datasets.load_dataset("./local/data.lance")

@eddyxu

@zhe-thoughts

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq

Cool ! I notice the current implementation doesn't support streaming because of the symlink hack.

I believe you can do something like this instead:

def _generate_tables(self, paths: list[str]):
    for path in paths:
        ds = lance.dataset(path)
        for frag_idx, fragment in enumerate(ds.get_fragments()):
            for batch_idx, batch in enumerate(
                fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size)
            ):
                table = pa.Table.from_batches([batch])
                table = self._cast_table(table)
                yield Key(frag_idx, batch_idx), table

note that path can be a local one, but also a hf:// URI

@eddyxu

@lhoestq

I took the liberty to make a few changes :)

Now I believe we should be good:

  • both local and streaming work fine
  • both dataset and single files work fine
  • all files are properly downloaded now than all files and metadata files are included in config.data_files
  • sharding is supported:
    • dataset: one shard = one fragment
    • single files: one shard = one file
  • streaming dataset resuming works fine thanks to Key()
  • the two hacks are visible and with TODOs to remove them when possible
    1. remove the revision in HF uris since only "main" is supported
    2. write proper _version/* files since lance doesn't work if they are symlinks

I think this PR is ready, just let me know what you think before we merge 🚀

The next steps are:

Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :)

And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF !

eddyxu

yield Key(frag_idx, batch_idx), self._cast_table(table)
else:
for file_idx, lance_file in enumerate(lance_files):
for batch_idx, batch in enumerate(lance_file.read_all(batch_size=self.config.batch_size).to_batches()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support columns pushdown here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added it at LanceFileReader() initialization, since the argument is not available in read_all()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how does it work with multiple data files within same fragment?

In Lance, one fragment can be 1 or more data files, where each data files cover a few columns. This is how we can add new features / column cheaply without rewriting the datasets, by adding new data files to existing fragment.

Maybe we can address it as follow up tasks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case it's a dataset no ? since it requires a manifest or something to tell what the fragments are made of

LanceFileReader() is only used for single files, i.e. that don't belong to a lance dataset directory or require manifest files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Lets 🚢 !!

@lhoestq