Skip to content

sofiane8910/Market_regime_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Regime Prediction

A comprehensive machine learning system for detecting and predicting market regime changes using multi-asset financial data. This project employs a three-algorithm pipeline: PCA (dimensionality reduction), KMeans (unsupervised clustering), and XGBoost (supervised learning) to identify distinct market regimes and forecast regime transitions.

Overview

This repository contains a complete regime detection and prediction pipeline that analyzes 26+ years of financial market data to identify four distinct market regimes and predict regime transitions with 67%+ accuracy (Accuracy is reported for reference; practical value is primarily derived from transition detection, confidence scores, and regime stability characteristics rather than point classification). The system integrates multiple data sources including equity indices, volatility measures, sector ETFs, and macroeconomic indicators to create a robust regime classification framework.

The system employs three models using three core algorithms:

MODEL 1: Macro Regime Detection (Layer 1)

  • Uses PCA + KMeans to identify longer-term market regimes
  • Processes macro-level features (5d-30d returns, economic indicators)

MODEL 2: Daily Regime Detection (Layer 2, Step 1)

  • Uses PCA + KMeans to identify daily market regimes
  • Processes daily-level features (1d-5d returns, short-term indicators)

MODEL 3: XGBoost Prediction (Layer 2, Step 2)

  • Uses XGBoost to predict future regimes and transitions
  • Combines outputs from Model 1 and Model 2 with all features

System Architecture

The following diagram illustrates the three-model architecture and how they interact within the pipeline:

┌─────────────────────────────────────────────────────────────────────────┐
│                         RAW DATA SOURCES                                 │
│  (SPY, VIX, Sector ETFs, Economic Indicators, Interest Rates)          │
└──────────────────────────────┬──────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    FEATURE ENGINEERING                                   │
│  Returns | Volatility | VIX | MACD | ATR | Sector Correlation | Macro  │
└──────────┬──────────────────────────────────────────────────────────────┘
           │
           ├──────────────────────────────────┬──────────────────────────┐
           │                                  │                          │
           ▼                                  ▼                          │
┌──────────────────────────────┐  ┌──────────────────────────────────────┐
│  MODEL 1: Macro Regime       │  │  MODEL 2: Daily Regime Detection    │
│  Detection (Layer 1)        │  │  (Layer 2, Step 1)                   │
│                              │  │                                      │
│  Input Features:             │  │  Input Features:                     │
│  • 5d, 10d, 20d, 30d returns │  │  • 1d, 2d, 3d, 5d returns           │
│  • 20d volatility            │  │  • 2d, 3d, 5d volatility             │
│  • VIX level & momentum      │  │  • VIX level                          │
│  • VIX percentile            │  │  • VIX percentiles (5d, 20d)        │
│  • ATR & percentile          │  │  • MACD histogram                    │
│  • MACD & signal              │  │  • ATR change & percentile           │
│  • Fed rate, inflation       │  │  • Returns skewness (5d, 10d)        │
│  • Returns skewness/kurtosis │  │  • Sector correlation change         │
│  • Sector correlation        │  │                                      │
│                              │  │                                      │
│  Processing Steps:           │  │  Processing Steps:                   │
│  1. StandardScaler           │  │  1. StandardScaler                  │
│  2. PCA (12 components, 96%) │  │  2. PCA (12 components, 96%)        │
│  3. K-Means (K=4)            │  │  3. K-Means (K=4)                    │
│                              │  │                                      │
│  Output:                     │  │  Output:                             │
│  • macro_regime (0,1,2,3)    │  │  • daily_regime (0,1,2,3)            │
└──────────┬───────────────────┘  └──────────┬───────────────────────────┘
           │                                  │
           │                                  │
           └──────────────┬───────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  MODEL 3: XGBoost Prediction (Layer 2, Step 2)                         │
│                                                                          │
│  Inputs:                                                                 │
│  • All daily features (from Model 2)                                    │
│  • macro_regime_change (derived from Model 1 output)                  │
│  • Regime signals (volatility, momentum, VIX)                          │
│  • Stability scores                                                      │
│  • Transition probabilities                                             │
│                                                                          │
│  Target Variables:                                                       │
│  • next_regime (shifted daily_regime from Model 2)                     │
│  • regime_transition (binary indicator)                                 │
│                                                                          │
│  Outputs:                                                                │
│  • Predicted next regime (0, 1, 2, or 3)                                │
│  • Transition probability                                               │
│  • Confidence scores                                                     │
│  • Regime probability distribution                                      │
│                                                                          │
│  Performance:                                                            │
│  • Next Regime Prediction Accuracy: 66.3%                               │
│  • Regime Transition Prediction Accuracy: 67.7%                         │
└─────────────────────────────────────────────────────────────────────────┘

Features

  • Multi-Asset Data Integration: Combines SPY (S&P 500), VIX (volatility), sector ETFs, economic indicators, and interest rates
  • Three-Algorithm Pipeline: PCA (dimensionality reduction) → KMeans (regime identification) → XGBoost (regime prediction)
  • Unsupervised Regime Identification: Uses PCA for dimensionality reduction followed by KMeans clustering to identify 4 distinct market regimes
  • Supervised Regime Prediction: Two XGBoost classifiers for next-regime and regime-transition prediction
  • Comprehensive Feature Engineering: 20+ engineered features including multi-timeframe returns, volatility measures, MACD signals, and sector correlations
  • Time-Series Validation: Proper train-test split to prevent look-ahead bias
  • Visualization: Comprehensive plots for data exploration, regime analysis, and model performance

Dataset

Time Period: February 2, 1999 to October 16, 2025
Frequency: Daily (weekdays only)
Sample Size: 6,719 trading days

Data Sources

  • SPY (S&P 500 ETF): Primary market indicator with OHLCV data
  • VIX (CBOE Volatility Index): Volatility and fear gauge
  • Sector ETFs: XLK, XLF, XLE, XLU, XLY, XLP, XLV, XLI - Sector rotation signals
  • Economic Indicators: CPI (inflation measure) from FRED
  • Interest Rates: Fed Funds Rate from FRED

Data Files

All data files are located in the data/ directory:

  • sp500_consolidated.csv - S&P 500 ETF data
  • vix_data.csv - VIX volatility index
  • sector_etfs_consolidated.csv - Sector ETF prices
  • economic_indicators_consolidated.csv - Economic indicators (CPI, etc.)
  • interest_rates_consolidated.csv - Interest rate data (Fed Funds Rate)

Methodology

1. Data Preparation

  • Loads and aligns multiple CSV files to a common daily index
  • Calculates engineered features:
    • Multi-timeframe returns (1d, 2d, 3d, 5d, 10d, 20d, 30d)
    • Volatility measures (20-day rolling standard deviation)
    • VIX features (level, momentum, percentiles)
    • Technical indicators (MACD, ATR, ATR percentiles)
    • Sector correlation spikes
    • Economic indicators (Fed Funds Rate, CPI)
    • Statistical measures (skewness, kurtosis)
  • Handles different data frequencies (daily vs. monthly) using forward fill
  • Creates lagged features to prevent data leakage

2. Regime Identification (Unsupervised)

This stage uses two algorithms working together:

  • Feature Engineering: Creates 20+ features from raw data
  • Standardization: StandardScaler for feature normalization
  • Algorithm 1 - PCA (Principal Component Analysis): Dimensionality reduction with 12 components (96.1% variance explained) to reduce feature space while preserving information
  • Algorithm 2 - KMeans Clustering: Identifies 4 distinct regime clusters from the PCA-reduced feature space
  • Regime Interpretation: Analyzes cluster characteristics to assign meaningful regime labels

3. Regime Prediction (Supervised)

This stage uses Algorithm 3 - XGBoost with two specialized models:

  • Next Regime Prediction Model: XGBoost classifier to predict the next day's regime (4-class classification)
  • Transition Prediction Model: XGBoost classifier to predict regime transitions (binary classification)
  • Time-Series Split: 80/20 train-test split maintaining temporal order
  • Feature Importance: Analyzes which features drive regime predictions

4. Validation & Analysis

  • Accuracy metrics for both classification tasks
  • Transition probability analysis
  • Regime stability analysis
  • Feature importance rankings
  • Temporal regime distribution analysis

Identified Regimes

The system identifies four distinct market regimes:

  1. High Vol Bull (7.6% of days)

    • Elevated volatility (VIX >30) with positive returns
    • Crisis recovery periods
    • Transition probability: ~40%
  2. Strong Bull (46.0% of days)

    • Low volatility (VIX <20) with strong positive momentum
    • Typical bull market phases
    • Most stable regime (25.7% transition probability)
  3. Normal Market (35.7% of days)

    • Moderate volatility with mixed returns
    • Typical market conditions
    • Transition probability: ~35%
  4. High Vol Bear (10.7% of days)

    • High volatility (VIX >30) with negative returns
    • Crisis/panic periods (2008, 2020)
    • Most volatile regime (46.4% transition probability)

Results

Model Performance

  • Next Regime Prediction Accuracy: 66.3% (vs. 25% random baseline for 4-class problem)
  • Regime Transition Prediction Accuracy: 67.7%
  • Daily Transition Rate: 34.1% of days show regime changes
  • PCA Variance Explained: 96.1% with 12 components

Key Insights

  1. Regime Stability: Strong Bull regime is most stable, while High Vol Bear is most volatile**
  2. Top Predictors: Transition probability (48.2%), VIX signals (18.8%), and returns (7.5%) are key features
  3. Market Dynamics: 34.1% daily transition rate demonstrates high market dynamism
  4. Temporal Alignment: Regime distributions align with historical market events

Usage

Prerequisites

pip install pandas numpy matplotlib seaborn scikit-learn xgboost

Running the Analysis

  1. Ensure all data files are in the data/ directory
  2. Open Regime_detection.ipynb in Jupyter Notebook
  3. Run all cells sequentially

The notebook is organized into sections:

  • Section 1: Data loading and preparation
  • Section 2: Regime detection system
  • Section 3: Model training and evaluation
  • Section 4: Results visualization and analysis

Project Structure

Regime_prediction/
├── README.md                          # This file
├── Regime_detection.ipynb            # Main analysis notebook
└── data/                              # Data directory
    ├── sp500_consolidated.csv
    ├── vix_data.csv
    ├── sector_etfs_consolidated.csv
    ├── economic_indicators_consolidated.csv
    └── interest_rates_consolidated.csv

Key Features & Techniques

  • Feature Engineering: Multi-timeframe analysis, rolling statistics, percentile rankings
  • Algorithm 1 - PCA: Dimensionality reduction for efficient clustering (reduces 20+ features to 12 principal components)
  • Algorithm 2 - KMeans: Unsupervised clustering for regime discovery (identifies 4 distinct market regimes)
  • Algorithm 3 - XGBoost: Supervised learning for regime prediction (two models: next regime and transition prediction)
  • Time-Series Handling: Proper lagging, forward-filling, and temporal splits
  • Data Alignment: Handles mixed-frequency data (daily vs. monthly)

Applications

This regime prediction system enables:

  • Risk Management: Adjust position sizing based on current regime
  • Early Warning Signals: Detect regime transitions before they fully materialize
  • Sector Rotation: Align sector allocation with market regimes
  • Strategy Adaptation: Modify trading strategies based on regime characteristics
  • Portfolio Optimization: Regime-aware portfolio construction

Technical Details

Feature Set (20+ features)

  • Returns: 1d, 5, 10, 20, 30-day rolling returns
  • Volatility: 20-day rolling standard deviation
  • VIX: Level, momentum, percentile rank
  • Technical: MACD, MACD signal, MACD histogram, ATR, ATR percentile
  • Macro: Fed Funds Rate, CPI inflation
  • Statistical: Returns skewness, returns kurtosis
  • Sector: Sector correlation spike

Model Configuration

Algorithm 1 - PCA:

  • 12 principal components
  • 96.1% variance explained
  • Applied after StandardScaler normalization

Algorithm 2 - KMeans:

  • 4 clusters (regimes)
  • random_state=42, n_init=10
  • Applied to PCA-reduced feature space

Algorithm 3 - XGBoost:

  • Next Regime Model: n_estimators=200, max_depth=6, learning_rate=0.1
  • Transition Model: n_estimators=200, max_depth=5, learning_rate=0.1
  • Time-series cross-validation with 5 folds
  • Train-Test Split: 80/20 maintaining temporal order

Limitations & Future Work

  • Data Frequency: Monthly economic indicators limit daily granularity
  • Regime Count: Fixed at 4 regimes; could explore dynamic regime count
  • Feature Selection: Could benefit from automated feature selection
  • Model Ensemble: Could combine multiple models for improved accuracy
  • Real-Time Application: Would need streaming data pipeline for live predictions

Citation

If you use this work, please cite:

Regime Prediction System
Multi-Asset Market Regime Detection and Prediction
Dataset: 1999-2025, 6,719 trading days
Method: PCA (Dimensionality Reduction) + KMeans Clustering + XGBoost Classification

License

This project is provided as-is for educational purposes.

Author

sofianel8910


Note: This system demonstrates that historically consistent market regimes exhibit statistically significant short-term predictability, providing a foundation for regime-aware trading strategies and risk management frameworks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors