Speed up past/future split by shchur · Pull Request #83 · autogluon/fev
Issue #, if available:
Currently the operations based on Dataset.filter and Dataset.map are quite slow. Just running the following code takes ~20 minutes and generates 10+GB of intermediate files in ~/.cache/huggingface/datasets
bench = fev.Benchmark.from_yaml( "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/fev_bench/tasks.yaml" ) for task in bench.tasks: for window in task.iter_windows(): window.get_input_data()
These are not the only bottlenecks - there are also slow map-based operations in the metrics that I will address in a separate PR.
Description of changes:
- Perform length-based filtering & past/future splits completely in memory using pyarrow operations without saving any intermediate results to disk. This results in a large speedup: iterating over all windows in
fev-benchtakes ~4 minutes (down from 20+) and does not save any results to disk. - The main logic is inspired by the efficient slicing algorithm from
TimeSeriesDataFramein AutoGluon that essentially performsdf.groupby("item_id").nth(slice(start, end))in flat numpy arrays. - I validated that the values in the datasets are identical (np.allclose) by sampling 1/7th of all evaluation windows in fev-bench and comparing the values on
main/ PR branch.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.