Skip to content

Sketch of an idea for statistics#15

Closed
scottmcm wants to merge 1 commit into
rscarson:masterfrom
scottmcm:stats-aggregators
Closed

Sketch of an idea for statistics#15
scottmcm wants to merge 1 commit into
rscarson:masterfrom
scottmcm:stats-aggregators

Conversation

@scottmcm
Copy link
Copy Markdown
Contributor

@scottmcm scottmcm commented Apr 2, 2026

Your note about tuples in #12 (comment)

I have a lot of functions that just return more than one statistic as a tuple so that I don't have to recalculate them, but that's pretty clumsy

Made me think of taking that ad-hoc way but turning it extensible by having types that can be combined into tuples so you can use any combination of them.

So here's a sketch of that idea 🙂

Definitely a draft in that I haven't really thought through the implications of it for more than the two that I implemented, and it might be that this makes no sense for more complex things without well-known online versions. But maybe it'll inspire you to come up with a better version. (Feel free to just close if you go a different direction.)

@rscarson
Copy link
Copy Markdown
Owner

rscarson commented Apr 7, 2026

I thought a good first step here was to map out the paths through the stats metrics to see the scope of the problem.
The main issues are:

  • Allocations where we can avoid them
  • Needing to consume iterators to count

Here is what I found:

r_squared -> r_squared_with_n

stddev_and_mean -> mean

skewness_and_kurtosis -> stddev_and_mean

residual_normality -> skewness_and_kurtosis -> stddev_and_mean

robust_r_squared -> median_absolute_deviation
                    huber_loss

adjusted_r_squared -> r_squared_with_n

root_mean_squared_error -> mse_with_n

mean_squared_error -> mse_with_n

huber_log_likelihood -> robust_mse_with_n

robust_mse_with_n -> median_absolute_deviation
                     huber_loss

median_absolute_deviation -> median

median_squared_deviation -> median

A good chunk of these can be mediated using singlepass algorithms, like the stdev/mean, r squared, as well as the normality tests

Others are a result of me trying to be clever and reduce code reuse. Take this example:

pub fn adjusted_r_squared<T: Value>(
    y: impl Iterator<Item = T>,
    y_fit: impl Iterator<Item = T>,
    k: T,
) -> T {
    let (r2, n) = r_squared_with_n(y, y_fit);
    r2 - (T::one() - r2) * k / (n - k)
}

Here, I need to consume the iterators to get the count for the adjusted r squared, but I also need the r squared value.
I've already found a way to avoid allocating in the r squared calculation, but I still need to consume the iterators to get the count.

I think your accumulator is probably the cleanest way to handle this, since we can also use it to centralize the Kahan (or Welford for the stdev) I should be using for the sums to avoid numerical instability. But it would be focused on collecting intermediate values for a single metric, rather than for cross-metric use, which I've more or less eliminated now

For median, I will switch to using quickselect through .select_nth_unstable and take in a mut slice instead of an iterator, since we do need to allocate
So all the upstream stuff for robust will need to allocate somewhere

@rscarson
Copy link
Copy Markdown
Owner

rscarson commented Apr 7, 2026

What do you think of something like this?

use crate::value::Value;

pub struct KahanAccumulator<T: Value> {
    sum: T,
    kahan: T,
}
impl<T: Value> Default for KahanAccumulator<T> {
    fn default() -> Self {
        KahanAccumulator {
            sum: T::zero(),
            kahan: T::zero(),
        }
    }
}
impl<T: Value> KahanAccumulator<T> {
    pub fn add(&mut self, value: T) {
        let y = value - self.kahan;
        let t = self.sum + y;
        self.kahan = (t - self.sum) - y;
        self.sum = t;
    }

    pub fn sum(&self) -> T {
        self.sum
    }
}

pub struct RSquaredAccumulator<T: Value> {
    count: T,
    sum: KahanAccumulator<T>,
    sum_sq: KahanAccumulator<T>,
    sum_res_sq: KahanAccumulator<T>,
}
impl<T: Value> Default for RSquaredAccumulator<T> {
    fn default() -> Self {
        RSquaredAccumulator {
            count: T::zero(),
            sum: KahanAccumulator::default(),
            sum_sq: KahanAccumulator::default(),
            sum_res_sq: KahanAccumulator::default(),
        }
    }
}
impl<T: Value> RSquaredAccumulator<T> {
    pub fn add(&mut self, y: T, y_fit: T) {
        self.count += T::one();
        self.sum.add(y);
        self.sum_sq.add(y * y);
        self.sum_res_sq.add(Value::powi(y - y_fit, 2));
    }

    pub fn r_squared(&self) -> Option<T> {
        if self.count == T::zero() {
            return None;
        }

        let mean = self.sum.sum() / self.count;
        let tss = self.sum_sq.sum() - self.count * mean * mean;
        let r2 = T::one() - (self.sum_res_sq.sum() / tss);
        Some(r2)
    }

    pub fn count(&self) -> T {
        self.count
    }
}
impl<T: Value> std::iter::FromIterator<(T, T)> for RSquaredAccumulator<T> {
    fn from_iter<I: IntoIterator<Item = (T, T)>>(iter: I) -> Self {
        let mut acc = RSquaredAccumulator::default();
        for (y, y_fit) in iter {
            acc.add(y, y_fit);
        }
        acc
    }
}

It gives us a public facing implementation like this:

pub fn adjusted_r_squared<T: Value>(
    y: impl Iterator<Item = T>,
    y_fit: impl Iterator<Item = T>,
    k: T,
) -> Option<T> {
    let acc: RSquaredAccumulator<T> = y.zip(y_fit).collect();
    let r2 = acc.r_squared()?;
    let n = acc.count();
    if n <= k {
        return None; // Not enough data points to compute adjusted R²
    }

    Some(r2 - (T::one() - r2) * k / (n - k))
}

pub fn r_squared<T: Value>(
    y: impl Iterator<Item = T>,
    y_fit: impl Iterator<Item = T>,
) -> Option<T> {
    let acc: RSquaredAccumulator<T> = y.zip(y_fit).collect();
    acc.r_squared()
}

@rscarson
Copy link
Copy Markdown
Owner

This is what I've come up with so far, it's a lot more maintainable and definitely have a lot fewer allocations

I'm still not really happy with the structure overall just because I couldn't think of a really clean way to turn this into a trait, ideally that could be unified so that I could make a parallel iterator version if the features turned on

https://github.com/rscarson/polyfit/blob/stats_rework/src%2Fstatistics%2Faccumulator.rs

@scottmcm
Copy link
Copy Markdown
Contributor Author

Yeah, the more I think about the the tuple trick I had in here doesn't really solve the core problem, since it doesn't have the "dependency graph" of getting things from other things.

What you have there with accumulator objects looks good to me. Especially since those can work well with Iterator::fold and such if needed -- allowing external drive instead of using the internal ones can make various things easier. Certainly having convenience things is good, but the flexibility to not need them is nice.

@scottmcm scottmcm closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants