A machine learning project to predict stock prices using historical stock data with sliding window features and walk-forward validation.
This project processes historical stock market data and prepares it for ML model training:
- Raw Data: Stock prices, fundamentals, and securities metadata from Kaggle
- Processing: Creates sliding windows (2-5 days) of features for each stock
- Validation: Walk-forward cross-validation (5 folds) for time-series data
- Normalization: StandardScaler pipeline for feature normalization
- Goal: Train ML models to predict next-day closing price
data/
├── raw/ # Kaggle CSV files (uploaded to git)
│ ├── prices.csv
│ ├── prices-split-adjusted.csv
│ ├── fundamentals.csv
│ └── securities.csv
├── naive_processed/ # Simple processing (ChatGPT baseline)
│ └── window_*.csv
├── full_processed/ # Full walk-forward validation
│ ├── X_train_window_*_fold_*.csv
│ ├── X_test_window_*_fold_*.csv
│ ├── X_eval_window_*.csv
│ ├── y_train_window_*_fold_*.csv
│ ├── y_test_window_*_fold_*.csv
│ └── y_eval_window_*.csv
└── normalized/ # Normalized data (StandardScaler)
└── window_*/
├── fold_*/
│ ├── X_train.csv, X_test.csv, y_train.csv, y_test.csv
│ └── scaler_pipeline.pkl
├── X_eval.csv
└── y_eval.csv
git clone <repo-url>
cd cmpe_257_projectInstall required packages via conda:
conda install numpy pandas python-dateutil pytz six tzdata scikit-learn matplotlib -yOr use pip if you prefer:
pip install -r requirements.txtpython -c "import pandas, numpy, sklearn; print('All packages imported successfully!')"Note: If you see ModuleNotFoundError, make sure conda environment is activated:
unalias python # Remove any python aliases
conda activate cmpe257You can either run each step manually (below) or use the automation helpers introduced in this update:
run_all_linear.sh: loops through all windows 2-5 and folds 0-4 withscripts/train_linear.py.Makefile: provides shortcuts such asmake normalize,make train_linear,make aggregate,make report, andmake full(runs the complete chain).
Make helper scripts executable once:
chmod +x scripts/*.py run_all_linear.shpython scripts/initial_data_exploration.pyThis prints dataset info (columns, dtypes, missing values) for all CSV files in data/raw/.
python scripts/process_data_full.pyWhat it does:
- Loads raw prices data
- Splits into train/eval by date (80/20)
- Creates sliding windows (2-5 days) for each company
- Applies walk-forward validation (5 folds)
- Generates features: open, close, low, high, volume at each time step
- Saves to
data/full_processed/
Duration: ~5-10 minutes depending on machine
Output: Separate train/test/eval sets for each window size and fold
python scripts/build_pipeline.pyWhat it does:
- Loads processed data from
data/full_processed/ - Creates StandardScaler for each fold
- Fits scaler on training data
- Normalizes train/test/eval sets
- Saves normalized data to
data/normalized/ - Saves fitted scalers (
scaler_pipeline.pkl) for later use
Duration: ~2-3 minutes
Output: Normalized datasets ready for model training
python scripts/full_data_exploration.pyThis prints statistics on the processed/normalized data and generates visualization plots.
./run_all_linear.sh
# or equivalently
make train_linearThis script calls scripts/train_linear.py for every window/fold combination and stores:
- Models in
models/linear_regression_*(or ridge variants) - Metrics JSON files in
artifacts/metrics_window_*_fold_*.json - Prediction plots in
reports/figs/
python scripts/train_tree_nn.py --window 3 --fold 0 --model rf --params '{"n_estimators":500}'Use --model gbr or --model mlp with appropriate JSON parameters to train additional regressors on any window/fold.
python scripts/aggregate_metrics.py # writes reports/metrics_summary.csv
python scripts/report_plots.py # creates reports/model_report.md + RMSE plotTo tune hyperparameters on any split:
python scripts/grid_search.py --window 3 --fold 0 --model ridge --param-grid '{"alpha":[0.1,1,10]}'python scripts/process_data_naive.pyThis creates a simple processed dataset (without walk-forward validation) for quick experiments and comparison.
Use the orchestrator in src/main.py to train every model listed in a YAML config and save detailed results under results/.
# Baseline linear + polynomial regression
python -m src.main --config configs/baseline.yaml
# Advanced XGBoost + LSTM experiments and other models (saves fitted models)
python -m src.main --config configs/all_models.yaml.yamlConfig anatomy:
config_name: label for the experiment folder insideresults/.windows: sliding-window sizes to iterate over.models: collection of{name, params}entries. Available names now includelinear_regression,polynomial_regression,random_forest_regressor,gradient_boosting_regressor,mlp_regressor,xgboost_regressor, andlstm_regressor.save_models: toggle persistence of trained estimators.
Add more configs (e.g., configs/<experiment>.yaml) to sweep different hyperparameters or estimators—the main function will automatically pick them up once the model is registered in models/__init__.py.
chmod +x scripts/*.py run_all_linear.sh
make full # runs normalize -> train_linear -> train_extra -> aggregate -> reportscripts/
├── initial_data_exploration.py # Explore raw CSV files
├── process_data_full.py # Full processing with walk-forward validation
├── process_data_naive.py # Simple baseline processing
├── full_data_exploration.py # Explore processed data
├── build_pipeline.py # Normalize data with StandardScaler
├── train_linear.py # Ridge / LinearRegression baseline trainer
├── train_tree_nn.py # RandomForest / GradientBoosting / MLP trainer
├── aggregate_metrics.py # Combine metrics JSON files into CSV summary
├── report_plots.py # Generate RMSE comparison plot + markdown
└── grid_search.py # Hyperparameter sweeps per window/fold
data/
├── raw/ # Raw Kaggle data
├── naive_processed/ # Simple processed data
├── full_processed/ # Walk-forward validation data
└── normalized/ # Normalized data for model training
experiments/ # Model training scripts (to be added)
reports/ # Results and analysis
Once normalized data is ready (data/normalized/), you can:
- Train baseline models (Linear Regression, Random Forest, etc.)
./run_all_linear.shormake train_linearpython scripts/train_tree_nn.py --model rf|gbr|mlp ...
- Use cross-validation / grid search to tune hyperparameters
python scripts/grid_search.py --model ridge --param-grid '{"alpha":[0.1,1,10]}'
- Aggregate and visualize metrics:
python scripts/aggregate_metrics.pypython scripts/report_plots.py
- Generate graphs showing:
- Parameter tuning results
- Model performance on eval set
- Predictions vs actual prices
- Compare models and document results using
reports/model_report.md
Use the normalized data structure:
import pandas as pd
from pathlib import Path
window_size = 3
fold = 0
X_train = pd.read_csv(f"data/normalized/window_{window_size}/fold_{fold}/X_train.csv")
y_train = pd.read_csv(f"data/normalized/window_{window_size}/fold_{fold}/y_train.csv")
X_test = pd.read_csv(f"data/normalized/window_{window_size}/fold_{fold}/X_test.csv")
y_test = pd.read_csv(f"data/normalized/window_{window_size}/fold_{fold}/y_test.csv")
X_eval = pd.read_csv(f"data/normalized/window_{window_size}/X_eval.csv")
y_eval = pd.read_csv(f"data/normalized/window_{window_size}/y_eval.csv")
# Your model training here...numpy- Numerical computingpandas- Data manipulationscikit-learn- ML preprocessing and modelsmatplotlib- Plotting and visualizationxgboost- Gradient-boosted trees (macOS users:brew install libompif you hit runtime loader errors)torch- Needed forlstm_regressor(install a wheel compatible with your Python version or build from source)python-dateutil,pytz,tzdata- Date/time handling
See requirements.txt for specific versions.
Issue: ModuleNotFoundError: No module named 'pandas'
- Solution: Ensure conda environment is activated and packages are installed
conda activate cmpe257 conda install pandas -y
Issue: Scripts take too long
- Solution: This is normal for large datasets. Process data once, then reuse normalized sets.
Issue: Out of memory errors
- Solution: The full dataset is large. If needed, modify scripts to process in batches or reduce data sample.
- Cale Payson, Sonali Lonkar, Ramya Gopalaswamy, Pranav Sehgal
- Repository: CMPE 257 Course Project
- Data source: Kaggle
- Processing: Custom walk-forward validation pipeline
- Framework: scikit-learn for preprocessing and scaling