Histogram (revisited) by LukeMathWalker · Pull Request #9 · rust-ndarray/ndarray-stats

I think this is a good approach.

A few thoughts:

  • Edges needs to be careful about repeated elements. There are two possible strategies:

    1. Edges can make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user.

    2. Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of .indexes() doesn't correctly handle repeated edges, since in the case of repeated elements, .binary_search() can return the index of any of those elements.

      Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.

  • I'd recommend removing the ndim member from HistogramCounts and instead adding a .ndim() method that calls self.counts.ndim(). (This means there are fewer things to keep in-sync.)

    Edit: In the .ndim() method, you could also have a line debug_assert_eq!(counts.ndim(), bins.len());. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.)

  • In the future, it would be worth adding the dimension of counts to the type of HistogramCounts instead of always using ArrayD. This isn't necessary for a first implementation, though.

  • In Histogram #8, I suggested separate HistogramCounts and HistogramDensity types. On further reflection, I don't think a HistogramDensity type is worth adding. Instead, I'd suggest adding a .density() method on HistogramCounts that returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can rename HistogramCounts to just Histogram.)