Add lance format support by eddyxu · Pull Request #7913 · huggingface/datasets
Add lance format as one of the packaged_modules.
import datasets ds = datasets.load_dataset("org/lance_repo", split="train") # Or ds = datasets.load_dataset("./local/data.lance")
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Cool ! I notice the current implementation doesn't support streaming because of the symlink hack.
I believe you can do something like this instead:
def _generate_tables(self, paths: list[str]): for path in paths: ds = lance.dataset(path) for frag_idx, fragment in enumerate(ds.get_fragments()): for batch_idx, batch in enumerate( fragment.to_batches(columns=self.config.columns, batch_size=self.config.batch_size) ): table = pa.Table.from_batches([batch]) table = self._cast_table(table) yield Key(frag_idx, batch_idx), table
note that path can be a local one, but also a hf:// URI
I took the liberty to make a few changes :)
Now I believe we should be good:
- both local and streaming work fine
- both dataset and single files work fine
- all files are properly downloaded now than all files and metadata files are included in config.data_files
- sharding is supported:
- dataset: one shard = one fragment
- single files: one shard = one file
- streaming dataset resuming works fine thanks to Key()
- the two hacks are visible and with TODOs to remove them when possible
- remove the revision in HF uris since only "main" is supported
- write proper _version/* files since lance doesn't work if they are symlinks
I think this PR is ready, just let me know what you think before we merge 🚀
The next steps are:
- open a PR in this repository to document Lance support in
datasets - open a PR in https://github.com/huggingface/hub-docs to add
pylanceto the list of integrated library on HF, and have some documentation on how to use it with datasets on HF (here is an example PR) - open a PR in https://github.com/huggingface/huggingface.js to add Lance as a supported dataset library on the HF website (here is an example PR)
Feel free to start some drafts (I noticed there are great examples in your HF account now !), I'll be happy to review :)
And once Lance is available in huggingface.js and docs are ready we'll be ready to enable the Dataset Viewer and Lance code snippets on HF !
| yield Key(frag_idx, batch_idx), self._cast_table(table) | ||
| else: | ||
| for file_idx, lance_file in enumerate(lance_files): | ||
| for batch_idx, batch in enumerate(lance_file.read_all(batch_size=self.config.batch_size).to_batches()): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support columns pushdown here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added it at LanceFileReader() initialization, since the argument is not available in read_all()
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how does it work with multiple data files within same fragment?
In Lance, one fragment can be 1 or more data files, where each data files cover a few columns. This is how we can add new features / column cheaply without rewriting the datasets, by adding new data files to existing fragment.
Maybe we can address it as follow up tasks.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case it's a dataset no ? since it requires a manifest or something to tell what the fragments are made of
LanceFileReader() is only used for single files, i.e. that don't belong to a lance dataset directory or require manifest files
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Lets 🚢 !!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters