forecast-compare

Academic forecast-comparison tools for paired loss series. The v0.2.0 API is centered on compare_models(), which runs pairwise tests, bootstrap intervals, and family-level procedures from one dictionary of per-observation losses.

Runtime dependencies are only numpy and scipy.

What It Does

Tool	Purpose
`compare_models`	One-call pairwise comparison workflow with IID, cluster, or stationary-bootstrap dependence handling
`dm_test`	Diebold-Mariano equal predictive accuracy test with Harvey-Leybourne-Newbold small-sample correction
`gw_test`	Giacomini-White conditional predictive ability test
`stability_diagnostic`	Andrews-style sup-F and CUSUM diagnostic for changes in the loss differential
`stationary_bootstrap_ci`	Politis-Romano stationary block-bootstrap CI and centered p-value
`model_confidence_set`	Hansen-Lunde-Nason Model Confidence Set
`bh_fdr`	Benjamini-Hochberg FDR correction
`loss`	Squared, absolute, quantile, log, and 0/1 losses with strict shape and finite-value validation

What This Package Does That No Other One Does

stability_diagnostic() treats forecast comparison as a time-indexed problem: it estimates whether the mean loss differential changed, where the largest break occurs, and whether CUSUM evidence agrees. This is meant for research workflows where the question is not only whether two models differ on average, but when that difference appears or disappears.

The diagnostic reports:

an Andrews-style trimmed sup_f_stat and break_index
a stationary-bootstrap p-value for the no-break null
a CUSUM statistic and bootstrap p-value
pre/post-break regime means for loss_a - loss_b

Install

pip install forecast-compare
pip install "forecast-compare[examples]"

Requires Python 3.10+.

Headline Example: AirPassengers

examples/m4_or_air_passengers.py ships a small AirPassengers CSV and runs rolling one-step forecasts from three simple time-series models.

python examples/m4_or_air_passengers.py

Core usage:

from forecast_compare import compare_models, stability_diagnostic
from forecast_compare.loss import squared_error

losses = {
    "seasonal_naive": squared_error(actual, seasonal_naive_forecast),
    "drift": squared_error(actual, drift_forecast),
    "exp_smooth": squared_error(actual, exp_smooth_forecast),
}

report = compare_models(
    losses,
    dependence="stationary",
    block_length="auto",
    family_method="maxT_stepdown",
    n_bootstrap=2_000,
    seed=7,
)
print(report.summary())

stability = stability_diagnostic(
    losses["seasonal_naive"],
    losses["exp_smooth"],
    n_bootstrap=1_000,
    seed=7,
)
print(stability.summary())

Example output:

model_a         model_b     mean_diff  dm_p       boot_p  adjusted_p  significant
--------------  ----------  ---------  ---------  ------  ----------  -----------
seasonal_naive  drift       -11.1937   0.9611     0.9595  0.9715      False
seasonal_naive  exp_smooth  -1250.4    0.001138   0.01    0.0095      True
drift           exp_smooth  -1239.21   0.0007634  0.0005  0.008       True

The secondary IID example is examples/cross_sectional_compare.py, using the scikit-learn diabetes dataset with a 40% test split.

Family-Level Procedures

compare_models(..., family_method=...) supports:

"none": report raw pairwise p-values
"bh_fdr": Benjamini-Hochberg adjusted p-values
"maxT_stepdown": Westfall-Young maxT stepdown using a joint bootstrap null
"mcs": Model Confidence Set returned in report.mcs_set

Use dependence="stationary" for serially dependent time-series loss differentials, dependence="cluster" when resampling labeled groups, and dependence="iid" for cross-sectional paired observations.

Missing Values And Shapes

Loss functions require matching shapes and finite values; they do not use NumPy broadcasting. Statistical routines default to missing="raise". dm_test, paired_bootstrap_ci, and cluster_bootstrap_ci also support missing="drop" for paired deletion.

Benchmarks

Reproducible Monte-Carlo size and power studies live in benchmarks/. They verify that dm_test, stability_diagnostic, and model_confidence_set behave the way the underlying papers predict. Committed result tables:

Reproduce locally with:

python benchmarks/dm_size_power.py
python benchmarks/fsd_size_power.py
python benchmarks/mcs_coverage.py

Roadmap

v0.3.0 candidates:

plotting helpers for comparison reports and stability diagnostics
Bayesian forecast-comparison summaries
Hansen SPA / related benchmark procedures

Citations

If you use this package in academic work, cite the underlying methods:

Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61(4), 821-856.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57(1), 289-300.
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253-263.
Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545-1578.
Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The model confidence set. Econometrica, 79(2), 453-497.
Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281-291.
Newey, W. K., & West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703-708.
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313.
Politis, D. N., & White, H. (2004). Automatic block-length selection for the dependent bootstrap. Econometric Reviews, 23(1), 53-70.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
src/forecast_compare		src/forecast_compare
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

forecast-compare

What It Does

What This Package Does That No Other One Does

Install

Headline Example: AirPassengers

Family-Level Procedures

Missing Values And Shapes

Benchmarks

Roadmap

Citations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

forecast-compare

What It Does

What This Package Does That No Other One Does

Install

Headline Example: AirPassengers

Family-Level Procedures

Missing Values And Shapes

Benchmarks

Roadmap

Citations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages