Academic forecast-comparison tools for paired loss series. The v0.2.0 API is
centered on compare_models(), which runs pairwise tests, bootstrap intervals,
and family-level procedures from one dictionary of per-observation losses.
Runtime dependencies are only numpy and scipy.
| Tool | Purpose |
|---|---|
compare_models |
One-call pairwise comparison workflow with IID, cluster, or stationary-bootstrap dependence handling |
dm_test |
Diebold-Mariano equal predictive accuracy test with Harvey-Leybourne-Newbold small-sample correction |
gw_test |
Giacomini-White conditional predictive ability test |
stability_diagnostic |
Andrews-style sup-F and CUSUM diagnostic for changes in the loss differential |
stationary_bootstrap_ci |
Politis-Romano stationary block-bootstrap CI and centered p-value |
model_confidence_set |
Hansen-Lunde-Nason Model Confidence Set |
bh_fdr |
Benjamini-Hochberg FDR correction |
loss |
Squared, absolute, quantile, log, and 0/1 losses with strict shape and finite-value validation |
stability_diagnostic() treats forecast comparison as a time-indexed problem:
it estimates whether the mean loss differential changed, where the largest
break occurs, and whether CUSUM evidence agrees. This is meant for research
workflows where the question is not only whether two models differ on average,
but when that difference appears or disappears.
The diagnostic reports:
- an Andrews-style trimmed
sup_f_statandbreak_index - a stationary-bootstrap p-value for the no-break null
- a CUSUM statistic and bootstrap p-value
- pre/post-break regime means for
loss_a - loss_b
pip install forecast-compare
pip install "forecast-compare[examples]"Requires Python 3.10+.
examples/m4_or_air_passengers.py ships a small AirPassengers CSV and runs
rolling one-step forecasts from three simple time-series models.
python examples/m4_or_air_passengers.pyCore usage:
from forecast_compare import compare_models, stability_diagnostic
from forecast_compare.loss import squared_error
losses = {
"seasonal_naive": squared_error(actual, seasonal_naive_forecast),
"drift": squared_error(actual, drift_forecast),
"exp_smooth": squared_error(actual, exp_smooth_forecast),
}
report = compare_models(
losses,
dependence="stationary",
block_length="auto",
family_method="maxT_stepdown",
n_bootstrap=2_000,
seed=7,
)
print(report.summary())
stability = stability_diagnostic(
losses["seasonal_naive"],
losses["exp_smooth"],
n_bootstrap=1_000,
seed=7,
)
print(stability.summary())Example output:
model_a model_b mean_diff dm_p boot_p adjusted_p significant
-------------- ---------- --------- --------- ------ ---------- -----------
seasonal_naive drift -11.1937 0.9611 0.9595 0.9715 False
seasonal_naive exp_smooth -1250.4 0.001138 0.01 0.0095 True
drift exp_smooth -1239.21 0.0007634 0.0005 0.008 True
The secondary IID example is examples/cross_sectional_compare.py, using the
scikit-learn diabetes dataset with a 40% test split.
compare_models(..., family_method=...) supports:
"none": report raw pairwise p-values"bh_fdr": Benjamini-Hochberg adjusted p-values"maxT_stepdown": Westfall-Young maxT stepdown using a joint bootstrap null"mcs": Model Confidence Set returned inreport.mcs_set
Use dependence="stationary" for serially dependent time-series loss
differentials, dependence="cluster" when resampling labeled groups, and
dependence="iid" for cross-sectional paired observations.
Loss functions require matching shapes and finite values; they do not use NumPy
broadcasting. Statistical routines default to missing="raise". dm_test,
paired_bootstrap_ci, and cluster_bootstrap_ci also support
missing="drop" for paired deletion.
Reproducible Monte-Carlo size and power studies live in benchmarks/. They
verify that dm_test, stability_diagnostic, and model_confidence_set
behave the way the underlying papers predict. Committed result tables:
Reproduce locally with:
python benchmarks/dm_size_power.py
python benchmarks/fsd_size_power.py
python benchmarks/mcs_coverage.pyv0.3.0 candidates:
- plotting helpers for comparison reports and stability diagnostics
- Bayesian forecast-comparison summaries
- Hansen SPA / related benchmark procedures
If you use this package in academic work, cite the underlying methods:
- Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61(4), 821-856.
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57(1), 289-300.
- Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253-263.
- Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545-1578.
- Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model confidence set. Econometrica, 79(2), 453-497.
- Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.
- Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703-708.
- Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313.
- Politis, D. N., & White, H. (2004). Automatic block-length selection for the dependent bootstrap. Econometric Reviews, 23(1), 53-70.
MIT. Copyright 2026 forecast-compare contributors.