`Count` and `Mean` aggregates by blaginin · Pull Request #7201 · vortex-data/vortex

`Count` and `Mean` aggregates by blaginin · Pull Request #7201 · vortex-data/vortex

Signed-off-by: blaginin <dima@spiraldb.com>
Co-authored-by: Claude <noreply@anthropic.com>

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <dima@spiraldb.com>

		/// Return the count of non-null elements in an array.
		///
		/// See [`Count`] for details.
		pub fn count(array: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<u64> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is counting things in an array, should it return a usize?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to register a generic aggregate kernel to reduce count-non-null to be Array.validity().sum(), then we avoid decompressing all the data.

		}

		fn partial_struct_dtype() -> DType {
		DType::Struct(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you static LazyLock this entire dtype inside this function?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be decimal for decimals?

Comment on lines +53 to +56

		pub struct MeanPartial {
		sum: f64,
		count: u64,
		}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute with a pretty large error?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$\mu_n = \sum_{i=0}^n x_i/n \iff \mu_{n+1} = \mu_n + \frac{x_{n+1} - \mu_n}{1+n}$$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need only hold the partial mean and count

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can incrementally push the next value into this partial, but you cannot combine two partials afaik?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For combining you do. $$\mu_{AB} = \frac{n_A \mu_A + n_B \mu_B}{n_A + n_B}$$

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in that formula there's even more division and hence more rounding errors?

I think there are some smart algorithms, but I haven't seen a simple implementation ever being a problem in df

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree. Let's do what DataFusion odes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF sum is T and ours is f64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error comes from numbers of different scales being combined, not from number of OPS.

Since we are storing these on disk we cannot "just change it later".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF only supports AVG for floats, decimals, and durations. And coerces all floats to f64.