Skip to content

JimCooke/t-digest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tdigest

This is an implementation of Ted Dunning's t-digest in Rust.

The implementaion is based off Derrick Burns' C++ implementation.

This repository has moved to https://gitlab.com/jimcooke.rust.public/tdigest (When Microsoft buys the git engine I use, written by someone they tried so hard to eradicate in the '90s, it is time for me to move elsewehere). By the way, gitlab seriously rocks and, with hindsight, I would have used them from the start had I known.

Example

extern crate tdigest;

fn main() {
    let mut t = tdigest::Tdigest::new(1000.0);

    // Compute Quantiles
    println!("  50th percentile is {}", t.quantile(0.50));
    println!("  75th percentile is {}", t.quantile(0.75));
    println!("  90th percentile is {}", t.quantile(0.90));
    println!("  99th percentile is {}", t.quantile(0.99));

   // Compute CDFs
    println!("cdf(1) {}", t.cdf(1.0));
    println!("cdf(2) {}", t.cdf(2.0));
    println!("cdf(3) {}", t.cdf(3.0));
    println!("cdf(4) {}", t.cdf(4.0));
    println!("cdf(5) {}", t.cdf(5.0));
}

Testing

Use the makefile from the root directory of this repository

make test

TODO

  • Implement MEAN, STDEV, COUNT, TOTAL
  • Repeat test changes ... load digest by reading test files.
  • Dump centroids
  • Save centroids
  • Reload centroids
  • Read data file file
  • Incorporate advice given by Ted Dunning himself for testing
    • Check out the tests for the Java version.
    • Test very large (>100,000) data sets, uniform and very skewed distributions
    • Test very small (<100) data sets
    • Test sorted and reverse sorted (which can affect accuracy and size)
    • Test mixed continuous and discrete distributions
    • Test a set that a large number of samples at a single point and all other points uniform but on the same side of the mass point
    • Test discrete distributions
    • Look at quantiles versus computed values from the original data
    • Look at number of centroids in the results
    • Look at error relative to minimum distance to 0 or 1
    • Verify that all centroids meet the size and delta k criteria
    • Verify that the sum of centroid weights equals the number of samples seen
    • Verify all the accuracies for direct insertion as well as for merging of multiple datasets
  • Once the tests pass, create a pull request to link this project to Ted Dunning's t-digest for other Rustaceans to find
  • Release the code as a cargo crate

About

A Rust language implementation of Ted Dunning's data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors