Finding data filenames with dask

Summary of a discussion with @bnlawrence, @sadielbartholomew and @davidhassell ...

There is a use case for files to know which files are in use by dask chunks (specifically, so that they may be retrieved from deep storage if they are not currently available).

"In use by dask" means those chunks that would be needed to provide the answer when the dask compute is run. A typical example is when a CFA variable data read into a dask array and sliced - in general some of the chunks have in essence been discarded.

Dask doesn't actuall discard any original chunks, rather it just stops referencing those that are not needed. It is presumed (but not confirmed) that at compute time the graph is analysed (by dask voodoo) from the top (last operation) down, from which it can ascertain which chunks are in play.

In the following example, 5 chunks are sliced down to 2. All 5 original chunks remain in the graph, but the top operation (getitem in this case) only references 2:

>>> import dask.array as da
>>> import numpy as np
>>> a = da.from_array(np.arange(9), chunks=2)
>>> dict(a.dask)
{('array-076524d02e31a08d351018b54f8bf5ef', 0): array([0, 1]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 1): array([2, 3]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 2): array([4, 5]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 3): array([6, 7]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 4): array([8])}
>>> b = a[1:4]
>>> dict(b.dask)
{('array-076524d02e31a08d351018b54f8bf5ef', 0): array([0, 1]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 1): array([2, 3]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 2): array([4, 5]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 3): array([6, 7]),
 ('array-076524d02e31a08d351018b54f8bf5ef', 4): array([8]),
 ('getitem-584428e015c7c29906d4a44fac83eeb5',
  0): (<function dask.array.chunk.getitem(obj, index)>, ('array-076524d02e31a08d351018b54f8bf5ef',
   0), (slice(1, 2, 1),)),
 ('getitem-584428e015c7c29906d4a44fac83eeb5',
  1): (<function dask.array.chunk.getitem(obj, index)>, ('array-076524d02e31a08d351018b54f8bf5ef',
   1), (slice(None, None, None),))}

We think that it should not be a problem to run this graph analysis ourselves, outside of a compute, to work out which original chunks are in play, and from these we should be to pull out any file names. Result (maybe!).

We do not plan to do this for v3.14 (the first daskified release), but it investigating it thereafter will be a priority.