Skip to content

studyalwaysbro/forecast-comparison-lab

Repository files navigation

forecast-compare

test python license

Academic forecast-comparison tools for paired loss series. The v0.2.0 API is centered on compare_models(), which runs pairwise tests, bootstrap intervals, and family-level procedures from one dictionary of per-observation losses.

Runtime dependencies are only numpy and scipy.

What It Does

Tool Purpose
compare_models One-call pairwise comparison workflow with IID, cluster, or stationary-bootstrap dependence handling
dm_test Diebold-Mariano equal predictive accuracy test with Harvey-Leybourne-Newbold small-sample correction
gw_test Giacomini-White conditional predictive ability test
stability_diagnostic Andrews-style sup-F and CUSUM diagnostic for changes in the loss differential
stationary_bootstrap_ci Politis-Romano stationary block-bootstrap CI and centered p-value
model_confidence_set Hansen-Lunde-Nason Model Confidence Set
bh_fdr Benjamini-Hochberg FDR correction
loss Squared, absolute, quantile, log, and 0/1 losses with strict shape and finite-value validation

What This Package Does That No Other One Does

stability_diagnostic() treats forecast comparison as a time-indexed problem: it estimates whether the mean loss differential changed, where the largest break occurs, and whether CUSUM evidence agrees. This is meant for research workflows where the question is not only whether two models differ on average, but when that difference appears or disappears.

The diagnostic reports:

  • an Andrews-style trimmed sup_f_stat and break_index
  • a stationary-bootstrap p-value for the no-break null
  • a CUSUM statistic and bootstrap p-value
  • pre/post-break regime means for loss_a - loss_b

Install

pip install forecast-compare
pip install "forecast-compare[examples]"

Requires Python 3.10+.

Headline Example: AirPassengers

examples/m4_or_air_passengers.py ships a small AirPassengers CSV and runs rolling one-step forecasts from three simple time-series models.

python examples/m4_or_air_passengers.py

Core usage:

from forecast_compare import compare_models, stability_diagnostic
from forecast_compare.loss import squared_error

losses = {
    "seasonal_naive": squared_error(actual, seasonal_naive_forecast),
    "drift": squared_error(actual, drift_forecast),
    "exp_smooth": squared_error(actual, exp_smooth_forecast),
}

report = compare_models(
    losses,
    dependence="stationary",
    block_length="auto",
    family_method="maxT_stepdown",
    n_bootstrap=2_000,
    seed=7,
)
print(report.summary())

stability = stability_diagnostic(
    losses["seasonal_naive"],
    losses["exp_smooth"],
    n_bootstrap=1_000,
    seed=7,
)
print(stability.summary())

Example output:

model_a         model_b     mean_diff  dm_p       boot_p  adjusted_p  significant
--------------  ----------  ---------  ---------  ------  ----------  -----------
seasonal_naive  drift       -11.1937   0.9611     0.9595  0.9715      False
seasonal_naive  exp_smooth  -1250.4    0.001138   0.01    0.0095      True
drift           exp_smooth  -1239.21   0.0007634  0.0005  0.008       True

The secondary IID example is examples/cross_sectional_compare.py, using the scikit-learn diabetes dataset with a 40% test split.

Family-Level Procedures

compare_models(..., family_method=...) supports:

  • "none": report raw pairwise p-values
  • "bh_fdr": Benjamini-Hochberg adjusted p-values
  • "maxT_stepdown": Westfall-Young maxT stepdown using a joint bootstrap null
  • "mcs": Model Confidence Set returned in report.mcs_set

Use dependence="stationary" for serially dependent time-series loss differentials, dependence="cluster" when resampling labeled groups, and dependence="iid" for cross-sectional paired observations.

Missing Values And Shapes

Loss functions require matching shapes and finite values; they do not use NumPy broadcasting. Statistical routines default to missing="raise". dm_test, paired_bootstrap_ci, and cluster_bootstrap_ci also support missing="drop" for paired deletion.

Benchmarks

Reproducible Monte-Carlo size and power studies live in benchmarks/. They verify that dm_test, stability_diagnostic, and model_confidence_set behave the way the underlying papers predict. Committed result tables:

Reproduce locally with:

python benchmarks/dm_size_power.py
python benchmarks/fsd_size_power.py
python benchmarks/mcs_coverage.py

Roadmap

v0.3.0 candidates:

  • plotting helpers for comparison reports and stability diagnostics
  • Bayesian forecast-comparison summaries
  • Hansen SPA / related benchmark procedures

Citations

If you use this package in academic work, cite the underlying methods:

  • Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61(4), 821-856.
  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57(1), 289-300.
  • Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253-263.
  • Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545-1578.
  • Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model confidence set. Econometrica, 79(2), 453-497.
  • Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.
  • Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703-708.
  • Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313.
  • Politis, D. N., & White, H. (2004). Automatic block-length selection for the dependent bootstrap. Econometric Reviews, 23(1), 53-70.

License

MIT. Copyright 2026 forecast-compare contributors.

About

Statistical comparison of forecasting models: Diebold-Mariano, stationary block bootstrap, Giacomini-White, Model Confidence Set, and a forecast stability diagnostic.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages