Skip to content

Akstrov/Stockmarket-Bigdata-Project

Repository files navigation

πŸ“ˆ Stock Market Prediction using Big Data & Sentiment Analysis

Real-time stock market prediction system combining Reddit sentiment analysis with financial data using modern big data technologies.

Python Kafka Spark MongoDB


πŸ“‹ Table of Contents


🎯 Overview

This project implements an end-to-end big data pipeline for stock price prediction by analyzing social media sentiment from Reddit's r/wallstreetbets community. The system processes historical data spanning September 29, 2020 to August 16, 2021, covering the GameStop short squeeze and the broader meme stock phenomenon, and uses machine learning models to predict stock movements.

Key Features

  • πŸ”„ Real-time data streaming using Kafka
  • ⚑ Distributed processing with Apache Spark
  • 🧠 Multiple ML models (Baseline, LSTM, Linear Regression, XGBoost)
  • πŸ“Š Interactive dashboard with live predictions
  • 🎯 Sentiment analysis from Reddit posts
  • πŸ“ˆ Technical indicators and feature engineering

Dataset Period

September 29, 2020 - August 16, 2021 (covering the GameStop squeeze and broader meme stock phenomenon)

Target Stocks

GME, AMC, TSLA, AAPL, BB, NOK, PLTR, SPCE


πŸ—οΈ Architecture

System Overview

graph TB
    subgraph "Data Layer"
        A1[Historical Reddit Data<br/>~1M posts]
        A2[Stock Prices<br/>Stooq API]
    end

    subgraph "Ingestion"
        B1[Data Splitter]
        B2[Training/Simulation Split]
    end

    subgraph "Streaming - Kafka"
        C1[Topic: reddit-data]
        C2[Topic: stock-data]
        C3[Zookeeper]
        C4[Data Replayer]
    end

    subgraph "Processing - Spark"
        D1[Spark Consumer]
        D2[Reddit Pipeline]
        D3[Stock Pipeline]
        D4[Data Fusion]
    end

    subgraph "Storage"
        E1[(MongoDB)]
        E2[Collections:<br/>reddit_raw<br/>stock_raw<br/>reddit_features_15m<br/>predictions]
    end

    subgraph "ML Layer"
        F1[Training Dataset Builder]
        F2[Baseline Model]
        F3[LSTM Model]
        F4[Linear Regression]
        F5[XGBoost Model]
        F6[MLflow Tracking]
    end

    subgraph "Service Layer"
        G1[Predictor Service]
        G2[Relayer Simulator]
        G3[Streamlit Dashboard]
    end

    A1 & A2 --> B1 --> B2
    B2 --> C4 --> C1 & C2
    C3 -.-> C1 & C2
    C1 --> D1 --> D2
    C2 --> D1 --> D3
    D2 & D3 --> D4 --> E1
    E1 --> F1
    F1 --> F2 & F3 & F4 & F5
    F2 & F3 & F4 & F5 -.-> F6
    E1 --> G1 --> G3
    G2 -.-> C4
    
    style C1 fill:#f3e5f5
    style E1 fill:#fff9c4
    style F5 fill:#c8e6c9
    style G3 fill:#bbdefb
Loading

Data Flow

flowchart LR
    A[Raw Data] --> B[Split Script]
    B --> C[Train Set<br/>Sep 2020-Mar 2021]
    B --> D[Simulate Set<br/>Apr-Aug 2021]
    
    C --> E[Training Pipeline]
    E --> F[Model Training]
    F --> G[Saved Models]
    
    D --> H[Kafka Producer]
    H --> I[Spark Consumer]
    I --> J[MongoDB]
    J --> K[Predictor Service]
    K --> L[Dashboard]
    
    G -.-> K
    
    style E fill:#e1f5dd
    style F fill:#fce4ec
    style K fill:#fff3e0
Loading

πŸ› οΈ Tech Stack

Big Data Technologies

  • Apache Kafka - Distributed streaming platform
  • Apache Spark - Distributed data processing
  • Zookeeper - Kafka coordination service
  • MongoDB - NoSQL database for time-series data

ML & Data Science

  • TensorFlow/Keras - Deep learning (LSTM models)
  • XGBoost - Gradient boosting for predictions
  • Scikit-learn - Classical ML algorithms
  • Pandas & NumPy - Data manipulation
  • MLflow - Model tracking and versioning

Orchestration & Deployment

  • Docker & Docker Compose - Containerization
  • Apache Airflow - Workflow orchestration
  • Streamlit - Interactive dashboard
  • Python 3.11 - Primary language

APIs & Data Sources

  • Stooq API - Historical stock prices
  • Reddit/Kaggle - r/wallstreetbets posts dataset

πŸ“ Project Structure

Stockmarket-Bigdata-Project/
β”‚
β”œβ”€β”€ data/                           # Data directory (gitignored)
β”‚   β”œβ”€β”€ raw/                        # Original datasets
β”‚   β”‚   β”œβ”€β”€ reddit_wsb.csv         # ~1M Reddit posts
β”‚   β”‚   └── stock_prices.csv       # Stock OHLCV data
β”‚   β”œβ”€β”€ train/                      # Training data (Sep 2020-Mar 2021)
β”‚   β”‚   β”œβ”€β”€ reddit_train.csv
β”‚   β”‚   └── stock_train.csv
β”‚   └── simulate/                   # Simulation data (Apr-Aug 2021)
β”‚       β”œβ”€β”€ reddit_sim.csv
β”‚       └── stock_sim.csv
β”‚
β”œβ”€β”€ data_collection/                # Data ingestion scripts
β”‚   β”œβ”€β”€ download_finance_stooq.py  # Stock price downloader
β”‚   β”œβ”€β”€ split_data.py              # Train/test splitter
β”‚   β”œβ”€β”€ producer_training_data.py  # Kafka producer
β”‚   β”œβ”€β”€ spark_consumer.py          # Spark streaming consumer
β”‚   β”œβ”€β”€ read_kafka_messages.py     # Kafka debugging tool
β”‚   β”œβ”€β”€ clean_kafka_topics.py      # Topic cleanup utility
β”‚   β”œβ”€β”€ test_spark.py              # Spark connection test
β”‚   └── verify_datasets.py         # Data validation
β”‚
β”œβ”€β”€ data_processing/                # ETL pipelines
β”‚   β”œβ”€β”€ reddit_pipeline.py         # Reddit data cleaning
β”‚   β”œβ”€β”€ stock_pipeline.py          # Stock data processing
β”‚   └── build_training_dataset.py  # Feature engineering
β”‚
β”œβ”€β”€ ml_models/                      # Machine learning models
β”‚   β”œβ”€β”€ 01_train_baseline_model.ipynb      # Baseline model
β”‚   β”œβ”€β”€ 02_train_baseline_model.ipynb      # Improved baseline
β”‚   β”œβ”€β”€ 03_train_LSTM_Model.ipynb          # LSTM deep learning
β”‚   β”œβ”€β”€ 05_Linear_regression_model.ipynb   # Linear regression
β”‚   β”œβ”€β”€ 06_XGBoost_model.ipynb             # XGBoost ensemble
β”‚   β”œβ”€β”€ baseline_model.joblib              # Saved baseline
β”‚   β”œβ”€β”€ price_predictor_v1.joblib          # Saved predictor v1
β”‚   └── xgboost_reddit_stock_model.pkl     # Saved XGBoost
β”‚
β”œβ”€β”€ orchestration/                  # Application layer
β”‚   β”œβ”€β”€ airflow_dags/              # Airflow DAG definitions
β”‚   β”œβ”€β”€ app.py                     # Streamlit dashboard
β”‚   β”œβ”€β”€ mongo.py                   # MongoDB utilities
β”‚   β”œβ”€β”€ predictor_service.py       # ML inference service
β”‚   β”œβ”€β”€ predit.py                  # Prediction helper
β”‚   └── relayer_simulator.py       # Data replay service
β”‚
β”œβ”€β”€ volumes/                        # Docker persistent volumes
β”‚   β”œβ”€β”€ airflow/
β”‚   β”‚   β”œβ”€β”€ logs/
β”‚   β”‚   └── plugins/
β”‚   └── mlflow/
β”‚
β”œβ”€β”€ docker-compose.yml             # Service orchestration
β”œβ”€β”€ init-kafka.sh                  # Kafka initialization
β”œβ”€β”€ requirements.txt               # Python dependencies
└── README.md                      # This file

πŸš€ Setup & Installation

Prerequisites

  • Docker & Docker Compose
  • Python 3.11+
  • 16GB RAM minimum (for Spark)
  • ~10GB disk space

1. Clone Repository

git clone <repository-url>
cd Stockmarket-Bigdata-Project

2. Install Python Dependencies

pip install -r requirements.txt

3. Download Datasets

Stock Data (automated):

python data_collection/download_finance_stooq.py

Reddit Data (manual):

  1. Download from Kaggle - Reddit WallStreetBets Posts
  2. Place reddit_wsb.csv in data/raw/

4. Verify Data

python data_collection/verify_datasets.py

5. Split Data (Train/Simulate)

python data_collection/split_data.py

6. Start Infrastructure

docker-compose up -d

7. Initialize Kafka Topics

chmod +x init-kafka.sh
./init-kafka.sh

8. Verify Services

# Check all containers are running
docker-compose ps

# Test Spark connection
python data_collection/test_spark.py

🌐 Service URLs

Service URL Credentials Purpose
Dashboard http://localhost:8501 - Interactive visualization
Airflow http://localhost:8081 admin / admin Workflow management
MLflow http://localhost:5000 - Model tracking
Spark UI http://localhost:8080 - Spark monitoring
Kafka localhost:29092 - Message broker
MongoDB localhost:27017 - Database

πŸ“Š Data Pipeline

Step 1: Data Collection

# Download stock prices from Stooq
python data_collection/download_finance_stooq.py

# Verify dataset integrity
python data_collection/verify_datasets.py

Step 2: Data Splitting

# Split into training (Sep 2020-Mar 2021) and simulation (Apr-Aug 2021)
python data_collection/split_data.py

Step 3: Start Streaming

# Start Kafka producer to replay historical data
python data_collection/producer_training_data.py

# Start Spark consumer in another terminal
python data_collection/spark_consumer.py

Step 4: Monitor Pipeline

# Read Kafka messages for debugging
python data_collection/read_kafka_messages.py

# Check MongoDB collections
python orchestration/mongo.py

πŸ€– ML Models

Trained Models

Model File Type Performance
Baseline baseline_model.joblib Random Forest Baseline metrics
Linear Regression price_predictor_v1.joblib Linear Fast inference
XGBoost xgboost_reddit_stock_model.pkl Ensemble Best accuracy
LSTM Notebook only Deep Learning Sequential patterns

Training Pipeline

  1. Build Training Dataset

    python data_processing/build_training_dataset.py
  2. Train Models (run Jupyter notebooks)

    jupyter notebook ml_models/
    • 01_train_baseline_model.ipynb - Random Forest baseline
    • 03_train_LSTM_Model.ipynb - LSTM for time series
    • 05_Linear_regression_model.ipynb - Linear regression
    • 06_XGBoost_model.ipynb - XGBoost ensemble
  3. Track with MLflow

    # MLflow automatically tracks experiments
    # View at http://localhost:5000

Feature Engineering

  • Stock Features: OHLCV, returns, volatility, moving averages
  • Reddit Features: Post count, sentiment scores, engagement metrics
  • Time Features: Hour, day of week, market hours
  • Technical Indicators: RSI, MACD, Bollinger Bands

πŸ“ˆ Dashboard

Launch Dashboard

streamlit run orchestration/app.py

Features

  • πŸ“Š Real-time Overview - Current prices, changes, Reddit activity
  • πŸ“ˆ Stock Analysis - Candlestick charts, volume, returns
  • πŸ’¬ Reddit Activity - Post frequency, sentiment trends
  • πŸ”— Correlation Analysis - Reddit sentiment vs stock movements
  • πŸ€– Predictions - ML model forecasts with accuracy metrics
  • πŸ”„ Auto-refresh - Updates every 30 seconds

Dashboard Tabs

  1. Overview - Key metrics and combined visualization
  2. Stock Analysis - Price charts, volume, returns
  3. Reddit Activity - Post frequency and engagement
  4. Correlation - Heatmaps showing relationships
  5. Predictions - Model predictions and accuracy

πŸ”§ Utilities & Scripts

Kafka Management

# Clean all Kafka topics
python data_collection/clean_kafka_topics.py

# Read messages from topics
python data_collection/read_kafka_messages.py

Data Validation

# Verify dataset completeness
python data_collection/verify_datasets.py

# Test Spark connectivity
python data_collection/test_spark.py

MongoDB Operations

# MongoDB utilities and queries
python orchestration/mongo.py

πŸ“Š Dataset Details

Time Period

  • Full Dataset: September 29, 2020 - August 16, 2021 (10.5 months)
  • Training: September 2020 - March 2021 (6 months)
  • Simulation/Testing: April 2021 - August 2021 (4.5 months)
  • Historical Context: GameStop short squeeze (Jan 2021) and meme stock era

Data Volume

  • Reddit Posts: ~1,000,000 posts from r/wallstreetbets
  • Stock Records: ~1,000 daily OHLCV records per ticker
  • Total Tickers: 8 stocks (GME, AMC, TSLA, AAPL, BB, NOK, PLTR, SPCE)

MongoDB Collections

  • reddit_raw - Raw Reddit posts
  • stock_raw - Raw stock prices
  • reddit_features_15m - Aggregated 15-minute Reddit features
  • predictions - Model predictions and evaluation metrics

πŸŽ“ Academic Context

Course: Big Data & Applications 2025-2026
Professor: Yasser El Madani El Alami
Institution: ISMAGI


πŸ‘₯ Team Members

Member Role Responsibilities
Member 1 Data Collection Engineer Data acquisition, Kafka setup, producers
Member 2 Data Processing Engineer Spark pipelines, ETL, data cleaning
Member 3 ML Engineer Model training, feature engineering, MLflow
Member 4 Integration Engineer Dashboard, orchestration, deployment

🚦 Project Status

βœ… Completed

  • Docker infrastructure (Kafka, Spark, MongoDB, Airflow, MLflow)
  • Data collection scripts (Stock + Reddit)
  • Historical data download (Jan-Jun 2021)
  • Data splitting (train/simulate)
  • Kafka topics and producers
  • Spark streaming consumer
  • MongoDB storage schema
  • Reddit & Stock processing pipelines
  • Feature engineering pipeline
  • Multiple ML models (Baseline, Linear, XGBoost, LSTM)
  • Streamlit dashboard with 5 tabs
  • Predictor service
  • Auto-refresh functionality

πŸ”„ In Progress

  • Airflow DAG automation
  • Model performance optimization
  • Real-time prediction inference
  • Advanced sentiment analysis

πŸ“‹ Future Enhancements

  • Deploy to cloud (AWS/GCP)
  • Add more tickers
  • Implement reinforcement learning
  • Real-time Twitter sentiment
  • Backtesting framework

πŸ› Troubleshooting

Common Issues

Kafka not starting

docker-compose down -v
docker-compose up -d
./init-kafka.sh

Spark consumer errors

# Check Spark is running
python data_collection/test_spark.py

# Check Kafka has messages
python data_collection/read_kafka_messages.py

MongoDB connection issues

# Restart MongoDB
docker-compose restart mongo

# Verify connection
python orchestration/mongo.py

Dashboard not showing data

# Verify MongoDB has data
python orchestration/mongo.py

# Clear cache and refresh
# Click "Refresh Data" in sidebar

πŸ“š References


πŸ“„ License

This project is developed for academic purposes as part of the Big Data & Applications course.


πŸ™ Acknowledgments

  • Professor Yasser El Madani El Alami for guidance
  • r/wallstreetbets community for the interesting case study
  • Kaggle for providing the Reddit dataset
  • Stooq for stock market data API

Last Updated: January 2026
Version: 1.0.0


For questions or issues, please contact the team members or create an issue in the repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors