Sketch of an idea for statistics#15
Conversation
|
I thought a good first step here was to map out the paths through the stats metrics to see the scope of the problem.
Here is what I found: A good chunk of these can be mediated using singlepass algorithms, like the stdev/mean, r squared, as well as the normality tests Others are a result of me trying to be clever and reduce code reuse. Take this example: pub fn adjusted_r_squared<T: Value>(
y: impl Iterator<Item = T>,
y_fit: impl Iterator<Item = T>,
k: T,
) -> T {
let (r2, n) = r_squared_with_n(y, y_fit);
r2 - (T::one() - r2) * k / (n - k)
}Here, I need to consume the iterators to get the count for the adjusted r squared, but I also need the r squared value. I think your accumulator is probably the cleanest way to handle this, since we can also use it to centralize the Kahan (or Welford for the stdev) I should be using for the sums to avoid numerical instability. But it would be focused on collecting intermediate values for a single metric, rather than for cross-metric use, which I've more or less eliminated now For median, I will switch to using quickselect through |
|
What do you think of something like this? use crate::value::Value;
pub struct KahanAccumulator<T: Value> {
sum: T,
kahan: T,
}
impl<T: Value> Default for KahanAccumulator<T> {
fn default() -> Self {
KahanAccumulator {
sum: T::zero(),
kahan: T::zero(),
}
}
}
impl<T: Value> KahanAccumulator<T> {
pub fn add(&mut self, value: T) {
let y = value - self.kahan;
let t = self.sum + y;
self.kahan = (t - self.sum) - y;
self.sum = t;
}
pub fn sum(&self) -> T {
self.sum
}
}
pub struct RSquaredAccumulator<T: Value> {
count: T,
sum: KahanAccumulator<T>,
sum_sq: KahanAccumulator<T>,
sum_res_sq: KahanAccumulator<T>,
}
impl<T: Value> Default for RSquaredAccumulator<T> {
fn default() -> Self {
RSquaredAccumulator {
count: T::zero(),
sum: KahanAccumulator::default(),
sum_sq: KahanAccumulator::default(),
sum_res_sq: KahanAccumulator::default(),
}
}
}
impl<T: Value> RSquaredAccumulator<T> {
pub fn add(&mut self, y: T, y_fit: T) {
self.count += T::one();
self.sum.add(y);
self.sum_sq.add(y * y);
self.sum_res_sq.add(Value::powi(y - y_fit, 2));
}
pub fn r_squared(&self) -> Option<T> {
if self.count == T::zero() {
return None;
}
let mean = self.sum.sum() / self.count;
let tss = self.sum_sq.sum() - self.count * mean * mean;
let r2 = T::one() - (self.sum_res_sq.sum() / tss);
Some(r2)
}
pub fn count(&self) -> T {
self.count
}
}
impl<T: Value> std::iter::FromIterator<(T, T)> for RSquaredAccumulator<T> {
fn from_iter<I: IntoIterator<Item = (T, T)>>(iter: I) -> Self {
let mut acc = RSquaredAccumulator::default();
for (y, y_fit) in iter {
acc.add(y, y_fit);
}
acc
}
}It gives us a public facing implementation like this: pub fn adjusted_r_squared<T: Value>(
y: impl Iterator<Item = T>,
y_fit: impl Iterator<Item = T>,
k: T,
) -> Option<T> {
let acc: RSquaredAccumulator<T> = y.zip(y_fit).collect();
let r2 = acc.r_squared()?;
let n = acc.count();
if n <= k {
return None; // Not enough data points to compute adjusted R²
}
Some(r2 - (T::one() - r2) * k / (n - k))
}
pub fn r_squared<T: Value>(
y: impl Iterator<Item = T>,
y_fit: impl Iterator<Item = T>,
) -> Option<T> {
let acc: RSquaredAccumulator<T> = y.zip(y_fit).collect();
acc.r_squared()
} |
|
This is what I've come up with so far, it's a lot more maintainable and definitely have a lot fewer allocations I'm still not really happy with the structure overall just because I couldn't think of a really clean way to turn this into a trait, ideally that could be unified so that I could make a parallel iterator version if the features turned on https://github.com/rscarson/polyfit/blob/stats_rework/src%2Fstatistics%2Faccumulator.rs |
|
Yeah, the more I think about the the tuple trick I had in here doesn't really solve the core problem, since it doesn't have the "dependency graph" of getting things from other things. What you have there with accumulator objects looks good to me. Especially since those can work well with |
Your note about tuples in #12 (comment)
Made me think of taking that ad-hoc way but turning it extensible by having types that can be combined into tuples so you can use any combination of them.
So here's a sketch of that idea 🙂
Definitely a draft in that I haven't really thought through the implications of it for more than the two that I implemented, and it might be that this makes no sense for more complex things without well-known online versions. But maybe it'll inspire you to come up with a better version. (Feel free to just close if you go a different direction.)