Histogram by LukeMathWalker · Pull Request #8 · rust-ndarray/ndarray-stats
Thanks for working on this. Unfortunately, I'm not really convinced with the approach.
With
BinsNdwe can express collections of bins that are more general than the ones supported by NumPy.
What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this:
+---+-------+-+
| | | |
+---+-------+-+
| | | |
| | | |
| | | |
| | | |
+---+-------+-+
but generally not this:
+---+-------+-+
| | | |
| +-------+-+
| | |
| | |
| | |
| | |
+---+-------+-+
or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use.
In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as edges = Vec<Array1<A>> where edges[0] is the array of edges for the first axis, edges[1] is the array of edges for the second axis, and edges[2] is the array of edges for the third axis. The total number of n-D bins represented this way is (edges[0].len() - 1) * (edges[1].len() - 1) * (edges[2].len() - 1).
I like this NumPy-style representation better than BinsNd for a few reasons:
-
The histogram counts can be represented as an n-D array. The
BinsNdapproach works best with a 1-D representation likeHashMap(the current PR),BTreeMaporVec, but this loses the shape information. For example, iterating along a particular axis is difficult. Even iterating over the all bins in a sensible order is difficult withHashMap. (I think you could handle the iteration over all bins in a good order by switching to aBTreeMapand carefully defining the ordering ofBinsNdor switching to aVecthe same length as theVecof bins, but it still would not be as convenient or flexible as an n-D array.) -
It's a lot easier to plot a histogram given the edges along each axis instead of
BinsNd. For example, consider the problem of determining the locations of the ticks along axis 1. With the NumPy-style representation, this is simplyedges[1]. With theBinsNdrepresentation it's not obvious. While we don't have any good plotting libraries in Rust at the moment, I think the tasks involved in plotting a histogram are fairly representative of the types of things you might want to do with a histogram. -
The NumPy-style representation only allows axes to be split. It does not allow for "merged" bins (the second picture I provided above) or overlapping bins. This is more restrictive, but it also makes working with the NumPy-style representation simpler because you don't have to worry about those special cases.
-
The size in memory of
BinsNdisO(n^ndim * ndim)wherenis the number of bins along each axis (assuming the number of bins along each axis is the same). (There aren^ndimtotal bins, and each one has sizendim.) In contrast, the size in memory of the NumPy-style representation isO(n*ndim). (There arendimaxes, and each one hasn+1edges.) This is a substantial difference, even for a moderately-sized case liken = 30, ndim = 3. I suppose you could argue that the size in memory of the counts isO(n^ndim), so why worry about the bins having a similar big-O, but I think it still makes sense to keep the memory required for the bins small. -
Possible performance differences (keep in mind these are just educated guesses without benchmarks):
-
The number of
Vecs in theBinsNdrepresentation is1 + n^ndim. (1for the outerVecandn^ndimVecs for all of the bins.) The large number ofVecs is problematic if the memory allocator doesn't do a good job keeping them close together in memory. (If theVecs are scattered across memory, caching doesn't work very well. Fortunately, IIUC jemalloc is pretty good at keeping allocations close together, so this may not actually be an issue in practice.) It would be possible to avoid this issue for fixed-dimensionBinNdby using an alternative representation forBinNd(e.g.{ lower_bounds: D, upper_bounds: D }). The NumPy-style representation doesn't have this issue at all, though; the number ofVecs is1 + ndim. (1for the outerVec, and one for each axis.) -
Interestingly enough, if the bins are sorted, both representations have the same big-O cost of finding the correct bin with binary search (
O(log(n^ndim)) = O(ndim * log(n))), but I would guess that the NumPy-style representation would be faster because it's more cache-friendly.
-
So, what I'd suggest is something like this:
/// Wrapper around `Array1` that makes sure the elements are in ascending order. struct Edges<A: Ord> { edges: Array1<A>, } impl<A: Ord> From<Array1<A>> for Edges<A> { fn from(mut edges: Array1<A>) -> Self { // sort the array in-place Edges { edges } } } impl<A: Ord> Edges<A> { fn view(&self) -> ArrayView1<A> { self.edges.view() } /// Returns the index of the bin containing the given value, /// or `None` if none of the bins contain the value. fn bin_index(&self, value: &A) -> Option<usize> { // binary search for the correct bin } /// Returns the range of the bin containing the given value. fn bin_range(&self, value: &A) -> Option<Range<A>> where A: Clone, { let i = self.bin_index(value); Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() } } } struct HistogramCounts<A: Ord> { counts: ArrayD<usize>, edges: Vec<Edges<A>>, } struct HistogramDensity<A: Ord> { density: ArrayD<A>, edges: Vec<Edges<A>>, } impl<A: Ord> HistogramCounts<A> { pub fn new(edges: Vec<Edges<A>>) -> Self { let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>()); HistogramCounts { counts, edges } } pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> { let bin = observation .iter() .zip(&self.edges) .map(|(v, e)| e.bin_index(v).ok_or(NotFound)) .collect::<Result<Vec<_>, _>>()?; self.counts[bin] += 1; } }
A few possible modifications:
-
An
EdgesNdwrapper type forVec<Edges<A>>might be nice to provide methods for finding the n-D index of the bin corresponding to a value and other similar operations. Alternatively, we could provide this functionality on theHistogram*types. -
We could combine
HistogramCountsandHistogramDensityinto a single type with a parameter for the element type of thecounts/densityarray. I'm pretty ambivalent about this, because I think aHistogramtrait implemented forHistogramCountsandHistogramDensitywould work just as well as a commonHistogramstruct. -
We could add the dimensionality to the type of
Edges/Histogram*. I think this is probably worth doing. -
Instead of
EdgescontainingArray1<A>, it could containVec<A>so that we could take advantage of the standard library's sorting and searching methods. (ndarraydoesn't have an equivalent to the standard library's.unstable_sort()or.binary_search().) -
I've used a requirement of
A: Ordfor simplicity, which would require usingN64(noisy-float's "not NaN" wrapper type) instead off64, for example. We might come up with some other way to allowPartialOrd. This is similar to the problem we had with the quantile methods.
What are your thoughts?