Histogram by LukeMathWalker · Pull Request #8

Histogram by LukeMathWalker · Pull Request #8 · rust-ndarray/ndarray-stats

Thanks for working on this. Unfortunately, I'm not really convinced with the approach.

With BinsNd we can express collections of bins that are more general than the ones supported by NumPy.

What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this:

+---+-------+-+
|   |       | |
+---+-------+-+
|   |       | |
|   |       | |
|   |       | |
|   |       | |
+---+-------+-+

but generally not this:

+---+-------+-+
|   |       | |
|   +-------+-+
|   |         |
|   |         |
|   |         |
|   |         |
+---+-------+-+

or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use.

In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as edges = Vec<Array1<A>> where edges[0] is the array of edges for the first axis, edges[1] is the array of edges for the second axis, and edges[2] is the array of edges for the third axis. The total number of n-D bins represented this way is (edges[0].len() - 1) * (edges[1].len() - 1) * (edges[2].len() - 1).

I like this NumPy-style representation better than BinsNd for a few reasons:

The histogram counts can be represented as an n-D array. The BinsNd approach works best with a 1-D representation like HashMap (the current PR), BTreeMap or Vec, but this loses the shape information. For example, iterating along a particular axis is difficult. Even iterating over the all bins in a sensible order is difficult with HashMap. (I think you could handle the iteration over all bins in a good order by switching to a BTreeMap and carefully defining the ordering of BinsNd or switching to a Vec the same length as the Vec of bins, but it still would not be as convenient or flexible as an n-D array.)
It's a lot easier to plot a histogram given the edges along each axis instead of BinsNd. For example, consider the problem of determining the locations of the ticks along axis 1. With the NumPy-style representation, this is simply edges[1]. With the BinsNd representation it's not obvious. While we don't have any good plotting libraries in Rust at the moment, I think the tasks involved in plotting a histogram are fairly representative of the types of things you might want to do with a histogram.
The NumPy-style representation only allows axes to be split. It does not allow for "merged" bins (the second picture I provided above) or overlapping bins. This is more restrictive, but it also makes working with the NumPy-style representation simpler because you don't have to worry about those special cases.
The size in memory of BinsNd is O(n^ndim * ndim) where n is the number of bins along each axis (assuming the number of bins along each axis is the same). (There are n^ndim total bins, and each one has size ndim.) In contrast, the size in memory of the NumPy-style representation is O(n*ndim). (There are ndim axes, and each one has n+1 edges.) This is a substantial difference, even for a moderately-sized case like n = 30, ndim = 3. I suppose you could argue that the size in memory of the counts is O(n^ndim), so why worry about the bins having a similar big-O, but I think it still makes sense to keep the memory required for the bins small.
Possible performance differences (keep in mind these are just educated guesses without benchmarks):
- The number of Vecs in the BinsNd representation is 1 + n^ndim. (1 for the outer Vec and n^ndim Vecs for all of the bins.) The large number of Vecs is problematic if the memory allocator doesn't do a good job keeping them close together in memory. (If the Vecs are scattered across memory, caching doesn't work very well. Fortunately, IIUC jemalloc is pretty good at keeping allocations close together, so this may not actually be an issue in practice.) It would be possible to avoid this issue for fixed-dimension BinNd by using an alternative representation for BinNd (e.g. { lower_bounds: D, upper_bounds: D }). The NumPy-style representation doesn't have this issue at all, though; the number of Vecs is 1 + ndim. (1 for the outer Vec, and one for each axis.)
- Interestingly enough, if the bins are sorted, both representations have the same big-O cost of finding the correct bin with binary search (O(log(n^ndim)) = O(ndim * log(n))), but I would guess that the NumPy-style representation would be faster because it's more cache-friendly.

So, what I'd suggest is something like this:

/// Wrapper around `Array1` that makes sure the elements are in ascending order.
struct Edges<A: Ord> {
    edges: Array1<A>,
}

impl<A: Ord> From<Array1<A>> for Edges<A> {
    fn from(mut edges: Array1<A>) -> Self {
        // sort the array in-place
        Edges { edges }
    }
}

impl<A: Ord> Edges<A> {
    fn view(&self) -> ArrayView1<A> {
        self.edges.view()
    }

    /// Returns the index of the bin containing the given value,
    /// or `None` if none of the bins contain the value.
    fn bin_index(&self, value: &A) -> Option<usize> {
        // binary search for the correct bin
    }

    /// Returns the range of the bin containing the given value.
    fn bin_range(&self, value: &A) -> Option<Range<A>>
    where
        A: Clone,
    {
        let i = self.bin_index(value);
        Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() }
    }
}

struct HistogramCounts<A: Ord> {
    counts: ArrayD<usize>,
    edges: Vec<Edges<A>>,
}

struct HistogramDensity<A: Ord> {
    density: ArrayD<A>,
    edges: Vec<Edges<A>>,
}

impl<A: Ord> HistogramCounts<A> {
    pub fn new(edges: Vec<Edges<A>>) -> Self {
        let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>());
        HistogramCounts { counts, edges }
    }

    pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> {
        let bin = observation
            .iter()
            .zip(&self.edges)
            .map(|(v, e)| e.bin_index(v).ok_or(NotFound))
            .collect::<Result<Vec<_>, _>>()?;
        self.counts[bin] += 1;
    }
}

A few possible modifications:

An EdgesNd wrapper type for Vec<Edges<A>> might be nice to provide methods for finding the n-D index of the bin corresponding to a value and other similar operations. Alternatively, we could provide this functionality on the Histogram* types.
We could combine HistogramCounts and HistogramDensity into a single type with a parameter for the element type of the counts/density array. I'm pretty ambivalent about this, because I think a Histogram trait implemented for HistogramCounts and HistogramDensity would work just as well as a common Histogram struct.
We could add the dimensionality to the type of Edges/Histogram*. I think this is probably worth doing.
Instead of Edges containing Array1<A>, it could contain Vec<A> so that we could take advantage of the standard library's sorting and searching methods. (ndarray doesn't have an equivalent to the standard library's .unstable_sort() or .binary_search().)
I've used a requirement of A: Ord for simplicity, which would require using N64 (noisy-float's "not NaN" wrapper type) instead of f64, for example. We might come up with some other way to allow PartialOrd. This is similar to the problem we had with the quantile methods.

What are your thoughts?