Speed up past/future split by shchur · Pull Request #83 · autogluon/fev

Issue #, if available:

Currently the operations based on Dataset.filter and Dataset.map are quite slow. Just running the following code takes ~20 minutes and generates 10+GB of intermediate files in ~/.cache/huggingface/datasets

bench = fev.Benchmark.from_yaml(
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/fev_bench/tasks.yaml"
)
for task in bench.tasks:
    for window in task.iter_windows():
        window.get_input_data()

These are not the only bottlenecks - there are also slow map-based operations in the metrics that I will address in a separate PR.

Description of changes:

  • Perform length-based filtering & past/future splits completely in memory using pyarrow operations without saving any intermediate results to disk. This results in a large speedup: iterating over all windows in fev-bench takes ~4 minutes (down from 20+) and does not save any results to disk.
  • The main logic is inspired by the efficient slicing algorithm from TimeSeriesDataFrame in AutoGluon that essentially performs df.groupby("item_id").nth(slice(start, end)) in flat numpy arrays.
  • I validated that the values in the datasets are identical (np.allclose) by sampling 1/7th of all evaluation windows in fev-bench and comparing the values on main / PR branch.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.