LAMA to Dask migration: `Data.stats` by sadielbartholomew · Pull Request #432 · NCAS-CMS/cf-python
Migrates the Data.stats method towards #182.
Since stats is a compound method, in that it is in essence just reporting the outputs from various other cf stats/collapsing methods, each of which have already been 'daskified' hence utilise dask under-the-hood, the simplest way to migrate with full performance (bar working out and making use of any sub-calculation operations which might be shared between any of the statistics or something like that which would surely be over the top for our purposes?) is (I believe) to run each statistic calculation in parallel, which in Dask world for this context is done with delayed functions and a final compute.
This PR implements this. As indicated by the resultant Dask task graph shown below, all statistics are calculated separately, not in serial, so stats should take only as long as the most intensive calculation rather than the sum of all calculation times.
(I hope I haven't misunderstood the assignment here, so to speak, by taking such an approach centered on Dask's delayed.)
Graph
With the code as-is on the branch/PR here and now, but with a little tweak so we can access and save the Dask task graph, namely I did:
diff --git a/cf/data/data.py b/cf/data/data.py index e91d6df03..2da53c70f 100644 --- a/cf/data/data.py +++ b/cf/data/data.py @@ -15,6 +15,7 @@ from dask.array.core import normalize_chunks from dask.base import is_dask_collection, tokenize from dask.core import flatten from dask.highlevelgraph import HighLevelGraph +from dask import delayed, compute, visualize from ..cfdatetime import dt as cf_dt from ..constants import masked as cf_masked @@ -9376,16 +9377,20 @@ class Data(DataClassDeprecationsMixin, Container, cfdm.Data): - return compute(out)[0] + return out
the task graph generated interactively via:
In [1]: import cf ...: import numpy as np ...: import dask In [2]: a = cf.example_field(1) In [3]: b = a.data.stats() In [4]: dask.visualize(b, filename='stats-graph-final.png') Out[4]: <IPython.core.display.Image object>
is:
which looks right to me. The task graph here doesn't indicate but obviously all results get pulled together into the dict output at compute-time at the end via querying the compute output tuple).
