Histogram (revisited) by LukeMathWalker · Pull Request #9 · rust-ndarray/ndarray-stats
I think this is a good approach.
A few thoughts:
-
Edgesneeds to be careful about repeated elements. There are two possible strategies:-
Edgescan make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user. -
Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of
.indexes()doesn't correctly handle repeated edges, since in the case of repeated elements,.binary_search()can return the index of any of those elements.Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.
-
-
I'd recommend removing the
ndimmember fromHistogramCountsand instead adding a.ndim()method that callsself.counts.ndim(). (This means there are fewer things to keep in-sync.)Edit: In the
.ndim()method, you could also have a linedebug_assert_eq!(counts.ndim(), bins.len());. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.) -
In the future, it would be worth adding the dimension of
countsto the type ofHistogramCountsinstead of always usingArrayD. This isn't necessary for a first implementation, though. -
In Histogram #8, I suggested separate
HistogramCountsandHistogramDensitytypes. On further reflection, I don't think aHistogramDensitytype is worth adding. Instead, I'd suggest adding a.density()method onHistogramCountsthat returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can renameHistogramCountsto justHistogram.)