A comprehensive machine learning system for detecting and predicting market regime changes using multi-asset financial data. This project employs a three-algorithm pipeline: PCA (dimensionality reduction), KMeans (unsupervised clustering), and XGBoost (supervised learning) to identify distinct market regimes and forecast regime transitions.
This repository contains a complete regime detection and prediction pipeline that analyzes 26+ years of financial market data to identify four distinct market regimes and predict regime transitions with 67%+ accuracy (Accuracy is reported for reference; practical value is primarily derived from transition detection, confidence scores, and regime stability characteristics rather than point classification). The system integrates multiple data sources including equity indices, volatility measures, sector ETFs, and macroeconomic indicators to create a robust regime classification framework.
The system employs three models using three core algorithms:
MODEL 1: Macro Regime Detection (Layer 1)
- Uses PCA + KMeans to identify longer-term market regimes
- Processes macro-level features (5d-30d returns, economic indicators)
MODEL 2: Daily Regime Detection (Layer 2, Step 1)
- Uses PCA + KMeans to identify daily market regimes
- Processes daily-level features (1d-5d returns, short-term indicators)
MODEL 3: XGBoost Prediction (Layer 2, Step 2)
- Uses XGBoost to predict future regimes and transitions
- Combines outputs from Model 1 and Model 2 with all features
The following diagram illustrates the three-model architecture and how they interact within the pipeline:
┌─────────────────────────────────────────────────────────────────────────┐
│ RAW DATA SOURCES │
│ (SPY, VIX, Sector ETFs, Economic Indicators, Interest Rates) │
└──────────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ FEATURE ENGINEERING │
│ Returns | Volatility | VIX | MACD | ATR | Sector Correlation | Macro │
└──────────┬──────────────────────────────────────────────────────────────┘
│
├──────────────────────────────────┬──────────────────────────┐
│ │ │
▼ ▼ │
┌──────────────────────────────┐ ┌──────────────────────────────────────┐
│ MODEL 1: Macro Regime │ │ MODEL 2: Daily Regime Detection │
│ Detection (Layer 1) │ │ (Layer 2, Step 1) │
│ │ │ │
│ Input Features: │ │ Input Features: │
│ • 5d, 10d, 20d, 30d returns │ │ • 1d, 2d, 3d, 5d returns │
│ • 20d volatility │ │ • 2d, 3d, 5d volatility │
│ • VIX level & momentum │ │ • VIX level │
│ • VIX percentile │ │ • VIX percentiles (5d, 20d) │
│ • ATR & percentile │ │ • MACD histogram │
│ • MACD & signal │ │ • ATR change & percentile │
│ • Fed rate, inflation │ │ • Returns skewness (5d, 10d) │
│ • Returns skewness/kurtosis │ │ • Sector correlation change │
│ • Sector correlation │ │ │
│ │ │ │
│ Processing Steps: │ │ Processing Steps: │
│ 1. StandardScaler │ │ 1. StandardScaler │
│ 2. PCA (12 components, 96%) │ │ 2. PCA (12 components, 96%) │
│ 3. K-Means (K=4) │ │ 3. K-Means (K=4) │
│ │ │ │
│ Output: │ │ Output: │
│ • macro_regime (0,1,2,3) │ │ • daily_regime (0,1,2,3) │
└──────────┬───────────────────┘ └──────────┬───────────────────────────┘
│ │
│ │
└──────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MODEL 3: XGBoost Prediction (Layer 2, Step 2) │
│ │
│ Inputs: │
│ • All daily features (from Model 2) │
│ • macro_regime_change (derived from Model 1 output) │
│ • Regime signals (volatility, momentum, VIX) │
│ • Stability scores │
│ • Transition probabilities │
│ │
│ Target Variables: │
│ • next_regime (shifted daily_regime from Model 2) │
│ • regime_transition (binary indicator) │
│ │
│ Outputs: │
│ • Predicted next regime (0, 1, 2, or 3) │
│ • Transition probability │
│ • Confidence scores │
│ • Regime probability distribution │
│ │
│ Performance: │
│ • Next Regime Prediction Accuracy: 66.3% │
│ • Regime Transition Prediction Accuracy: 67.7% │
└─────────────────────────────────────────────────────────────────────────┘
- Multi-Asset Data Integration: Combines SPY (S&P 500), VIX (volatility), sector ETFs, economic indicators, and interest rates
- Three-Algorithm Pipeline: PCA (dimensionality reduction) → KMeans (regime identification) → XGBoost (regime prediction)
- Unsupervised Regime Identification: Uses PCA for dimensionality reduction followed by KMeans clustering to identify 4 distinct market regimes
- Supervised Regime Prediction: Two XGBoost classifiers for next-regime and regime-transition prediction
- Comprehensive Feature Engineering: 20+ engineered features including multi-timeframe returns, volatility measures, MACD signals, and sector correlations
- Time-Series Validation: Proper train-test split to prevent look-ahead bias
- Visualization: Comprehensive plots for data exploration, regime analysis, and model performance
Time Period: February 2, 1999 to October 16, 2025
Frequency: Daily (weekdays only)
Sample Size: 6,719 trading days
- SPY (S&P 500 ETF): Primary market indicator with OHLCV data
- VIX (CBOE Volatility Index): Volatility and fear gauge
- Sector ETFs: XLK, XLF, XLE, XLU, XLY, XLP, XLV, XLI - Sector rotation signals
- Economic Indicators: CPI (inflation measure) from FRED
- Interest Rates: Fed Funds Rate from FRED
All data files are located in the data/ directory:
sp500_consolidated.csv- S&P 500 ETF datavix_data.csv- VIX volatility indexsector_etfs_consolidated.csv- Sector ETF priceseconomic_indicators_consolidated.csv- Economic indicators (CPI, etc.)interest_rates_consolidated.csv- Interest rate data (Fed Funds Rate)
- Loads and aligns multiple CSV files to a common daily index
- Calculates engineered features:
- Multi-timeframe returns (1d, 2d, 3d, 5d, 10d, 20d, 30d)
- Volatility measures (20-day rolling standard deviation)
- VIX features (level, momentum, percentiles)
- Technical indicators (MACD, ATR, ATR percentiles)
- Sector correlation spikes
- Economic indicators (Fed Funds Rate, CPI)
- Statistical measures (skewness, kurtosis)
- Handles different data frequencies (daily vs. monthly) using forward fill
- Creates lagged features to prevent data leakage
This stage uses two algorithms working together:
- Feature Engineering: Creates 20+ features from raw data
- Standardization: StandardScaler for feature normalization
- Algorithm 1 - PCA (Principal Component Analysis): Dimensionality reduction with 12 components (96.1% variance explained) to reduce feature space while preserving information
- Algorithm 2 - KMeans Clustering: Identifies 4 distinct regime clusters from the PCA-reduced feature space
- Regime Interpretation: Analyzes cluster characteristics to assign meaningful regime labels
This stage uses Algorithm 3 - XGBoost with two specialized models:
- Next Regime Prediction Model: XGBoost classifier to predict the next day's regime (4-class classification)
- Transition Prediction Model: XGBoost classifier to predict regime transitions (binary classification)
- Time-Series Split: 80/20 train-test split maintaining temporal order
- Feature Importance: Analyzes which features drive regime predictions
- Accuracy metrics for both classification tasks
- Transition probability analysis
- Regime stability analysis
- Feature importance rankings
- Temporal regime distribution analysis
The system identifies four distinct market regimes:
-
High Vol Bull (7.6% of days)
- Elevated volatility (VIX >30) with positive returns
- Crisis recovery periods
- Transition probability: ~40%
-
Strong Bull (46.0% of days)
- Low volatility (VIX <20) with strong positive momentum
- Typical bull market phases
- Most stable regime (25.7% transition probability)
-
Normal Market (35.7% of days)
- Moderate volatility with mixed returns
- Typical market conditions
- Transition probability: ~35%
-
High Vol Bear (10.7% of days)
- High volatility (VIX >30) with negative returns
- Crisis/panic periods (2008, 2020)
- Most volatile regime (46.4% transition probability)
- Next Regime Prediction Accuracy: 66.3% (vs. 25% random baseline for 4-class problem)
- Regime Transition Prediction Accuracy: 67.7%
- Daily Transition Rate: 34.1% of days show regime changes
- PCA Variance Explained: 96.1% with 12 components
- Regime Stability: Strong Bull regime is most stable, while High Vol Bear is most volatile**
- Top Predictors: Transition probability (48.2%), VIX signals (18.8%), and returns (7.5%) are key features
- Market Dynamics: 34.1% daily transition rate demonstrates high market dynamism
- Temporal Alignment: Regime distributions align with historical market events
pip install pandas numpy matplotlib seaborn scikit-learn xgboost- Ensure all data files are in the
data/directory - Open
Regime_detection.ipynbin Jupyter Notebook - Run all cells sequentially
The notebook is organized into sections:
- Section 1: Data loading and preparation
- Section 2: Regime detection system
- Section 3: Model training and evaluation
- Section 4: Results visualization and analysis
Regime_prediction/
├── README.md # This file
├── Regime_detection.ipynb # Main analysis notebook
└── data/ # Data directory
├── sp500_consolidated.csv
├── vix_data.csv
├── sector_etfs_consolidated.csv
├── economic_indicators_consolidated.csv
└── interest_rates_consolidated.csv
- Feature Engineering: Multi-timeframe analysis, rolling statistics, percentile rankings
- Algorithm 1 - PCA: Dimensionality reduction for efficient clustering (reduces 20+ features to 12 principal components)
- Algorithm 2 - KMeans: Unsupervised clustering for regime discovery (identifies 4 distinct market regimes)
- Algorithm 3 - XGBoost: Supervised learning for regime prediction (two models: next regime and transition prediction)
- Time-Series Handling: Proper lagging, forward-filling, and temporal splits
- Data Alignment: Handles mixed-frequency data (daily vs. monthly)
This regime prediction system enables:
- Risk Management: Adjust position sizing based on current regime
- Early Warning Signals: Detect regime transitions before they fully materialize
- Sector Rotation: Align sector allocation with market regimes
- Strategy Adaptation: Modify trading strategies based on regime characteristics
- Portfolio Optimization: Regime-aware portfolio construction
- Returns: 1d, 5, 10, 20, 30-day rolling returns
- Volatility: 20-day rolling standard deviation
- VIX: Level, momentum, percentile rank
- Technical: MACD, MACD signal, MACD histogram, ATR, ATR percentile
- Macro: Fed Funds Rate, CPI inflation
- Statistical: Returns skewness, returns kurtosis
- Sector: Sector correlation spike
Algorithm 1 - PCA:
- 12 principal components
- 96.1% variance explained
- Applied after StandardScaler normalization
Algorithm 2 - KMeans:
- 4 clusters (regimes)
- random_state=42, n_init=10
- Applied to PCA-reduced feature space
Algorithm 3 - XGBoost:
- Next Regime Model: n_estimators=200, max_depth=6, learning_rate=0.1
- Transition Model: n_estimators=200, max_depth=5, learning_rate=0.1
- Time-series cross-validation with 5 folds
- Train-Test Split: 80/20 maintaining temporal order
- Data Frequency: Monthly economic indicators limit daily granularity
- Regime Count: Fixed at 4 regimes; could explore dynamic regime count
- Feature Selection: Could benefit from automated feature selection
- Model Ensemble: Could combine multiple models for improved accuracy
- Real-Time Application: Would need streaming data pipeline for live predictions
If you use this work, please cite:
Regime Prediction System
Multi-Asset Market Regime Detection and Prediction
Dataset: 1999-2025, 6,719 trading days
Method: PCA (Dimensionality Reduction) + KMeans Clustering + XGBoost Classification
This project is provided as-is for educational purposes.
sofianel8910
Note: This system demonstrates that historically consistent market regimes exhibit statistically significant short-term predictability, providing a foundation for regime-aware trading strategies and risk management frameworks.