Real-time stock market prediction system combining Reddit sentiment analysis with financial data using modern big data technologies.
- Overview
- Architecture
- Tech Stack
- Project Structure
- Setup & Installation
- Data Pipeline
- ML Models
- Dashboard
- Team
This project implements an end-to-end big data pipeline for stock price prediction by analyzing social media sentiment from Reddit's r/wallstreetbets community. The system processes historical data spanning September 29, 2020 to August 16, 2021, covering the GameStop short squeeze and the broader meme stock phenomenon, and uses machine learning models to predict stock movements.
- π Real-time data streaming using Kafka
- β‘ Distributed processing with Apache Spark
- π§ Multiple ML models (Baseline, LSTM, Linear Regression, XGBoost)
- π Interactive dashboard with live predictions
- π― Sentiment analysis from Reddit posts
- π Technical indicators and feature engineering
September 29, 2020 - August 16, 2021 (covering the GameStop squeeze and broader meme stock phenomenon)
GME, AMC, TSLA, AAPL, BB, NOK, PLTR, SPCE
graph TB
subgraph "Data Layer"
A1[Historical Reddit Data<br/>~1M posts]
A2[Stock Prices<br/>Stooq API]
end
subgraph "Ingestion"
B1[Data Splitter]
B2[Training/Simulation Split]
end
subgraph "Streaming - Kafka"
C1[Topic: reddit-data]
C2[Topic: stock-data]
C3[Zookeeper]
C4[Data Replayer]
end
subgraph "Processing - Spark"
D1[Spark Consumer]
D2[Reddit Pipeline]
D3[Stock Pipeline]
D4[Data Fusion]
end
subgraph "Storage"
E1[(MongoDB)]
E2[Collections:<br/>reddit_raw<br/>stock_raw<br/>reddit_features_15m<br/>predictions]
end
subgraph "ML Layer"
F1[Training Dataset Builder]
F2[Baseline Model]
F3[LSTM Model]
F4[Linear Regression]
F5[XGBoost Model]
F6[MLflow Tracking]
end
subgraph "Service Layer"
G1[Predictor Service]
G2[Relayer Simulator]
G3[Streamlit Dashboard]
end
A1 & A2 --> B1 --> B2
B2 --> C4 --> C1 & C2
C3 -.-> C1 & C2
C1 --> D1 --> D2
C2 --> D1 --> D3
D2 & D3 --> D4 --> E1
E1 --> F1
F1 --> F2 & F3 & F4 & F5
F2 & F3 & F4 & F5 -.-> F6
E1 --> G1 --> G3
G2 -.-> C4
style C1 fill:#f3e5f5
style E1 fill:#fff9c4
style F5 fill:#c8e6c9
style G3 fill:#bbdefb
flowchart LR
A[Raw Data] --> B[Split Script]
B --> C[Train Set<br/>Sep 2020-Mar 2021]
B --> D[Simulate Set<br/>Apr-Aug 2021]
C --> E[Training Pipeline]
E --> F[Model Training]
F --> G[Saved Models]
D --> H[Kafka Producer]
H --> I[Spark Consumer]
I --> J[MongoDB]
J --> K[Predictor Service]
K --> L[Dashboard]
G -.-> K
style E fill:#e1f5dd
style F fill:#fce4ec
style K fill:#fff3e0
- Apache Kafka - Distributed streaming platform
- Apache Spark - Distributed data processing
- Zookeeper - Kafka coordination service
- MongoDB - NoSQL database for time-series data
- TensorFlow/Keras - Deep learning (LSTM models)
- XGBoost - Gradient boosting for predictions
- Scikit-learn - Classical ML algorithms
- Pandas & NumPy - Data manipulation
- MLflow - Model tracking and versioning
- Docker & Docker Compose - Containerization
- Apache Airflow - Workflow orchestration
- Streamlit - Interactive dashboard
- Python 3.11 - Primary language
- Stooq API - Historical stock prices
- Reddit/Kaggle - r/wallstreetbets posts dataset
Stockmarket-Bigdata-Project/
β
βββ data/ # Data directory (gitignored)
β βββ raw/ # Original datasets
β β βββ reddit_wsb.csv # ~1M Reddit posts
β β βββ stock_prices.csv # Stock OHLCV data
β βββ train/ # Training data (Sep 2020-Mar 2021)
β β βββ reddit_train.csv
β β βββ stock_train.csv
β βββ simulate/ # Simulation data (Apr-Aug 2021)
β βββ reddit_sim.csv
β βββ stock_sim.csv
β
βββ data_collection/ # Data ingestion scripts
β βββ download_finance_stooq.py # Stock price downloader
β βββ split_data.py # Train/test splitter
β βββ producer_training_data.py # Kafka producer
β βββ spark_consumer.py # Spark streaming consumer
β βββ read_kafka_messages.py # Kafka debugging tool
β βββ clean_kafka_topics.py # Topic cleanup utility
β βββ test_spark.py # Spark connection test
β βββ verify_datasets.py # Data validation
β
βββ data_processing/ # ETL pipelines
β βββ reddit_pipeline.py # Reddit data cleaning
β βββ stock_pipeline.py # Stock data processing
β βββ build_training_dataset.py # Feature engineering
β
βββ ml_models/ # Machine learning models
β βββ 01_train_baseline_model.ipynb # Baseline model
β βββ 02_train_baseline_model.ipynb # Improved baseline
β βββ 03_train_LSTM_Model.ipynb # LSTM deep learning
β βββ 05_Linear_regression_model.ipynb # Linear regression
β βββ 06_XGBoost_model.ipynb # XGBoost ensemble
β βββ baseline_model.joblib # Saved baseline
β βββ price_predictor_v1.joblib # Saved predictor v1
β βββ xgboost_reddit_stock_model.pkl # Saved XGBoost
β
βββ orchestration/ # Application layer
β βββ airflow_dags/ # Airflow DAG definitions
β βββ app.py # Streamlit dashboard
β βββ mongo.py # MongoDB utilities
β βββ predictor_service.py # ML inference service
β βββ predit.py # Prediction helper
β βββ relayer_simulator.py # Data replay service
β
βββ volumes/ # Docker persistent volumes
β βββ airflow/
β β βββ logs/
β β βββ plugins/
β βββ mlflow/
β
βββ docker-compose.yml # Service orchestration
βββ init-kafka.sh # Kafka initialization
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Docker & Docker Compose
- Python 3.11+
- 16GB RAM minimum (for Spark)
- ~10GB disk space
git clone <repository-url>
cd Stockmarket-Bigdata-Projectpip install -r requirements.txtStock Data (automated):
python data_collection/download_finance_stooq.pyReddit Data (manual):
- Download from Kaggle - Reddit WallStreetBets Posts
- Place
reddit_wsb.csvindata/raw/
python data_collection/verify_datasets.pypython data_collection/split_data.pydocker-compose up -dchmod +x init-kafka.sh
./init-kafka.sh# Check all containers are running
docker-compose ps
# Test Spark connection
python data_collection/test_spark.py| Service | URL | Credentials | Purpose |
|---|---|---|---|
| Dashboard | http://localhost:8501 | - | Interactive visualization |
| Airflow | http://localhost:8081 | admin / admin | Workflow management |
| MLflow | http://localhost:5000 | - | Model tracking |
| Spark UI | http://localhost:8080 | - | Spark monitoring |
| Kafka | localhost:29092 | - | Message broker |
| MongoDB | localhost:27017 | - | Database |
# Download stock prices from Stooq
python data_collection/download_finance_stooq.py
# Verify dataset integrity
python data_collection/verify_datasets.py# Split into training (Sep 2020-Mar 2021) and simulation (Apr-Aug 2021)
python data_collection/split_data.py# Start Kafka producer to replay historical data
python data_collection/producer_training_data.py
# Start Spark consumer in another terminal
python data_collection/spark_consumer.py# Read Kafka messages for debugging
python data_collection/read_kafka_messages.py
# Check MongoDB collections
python orchestration/mongo.py| Model | File | Type | Performance |
|---|---|---|---|
| Baseline | baseline_model.joblib |
Random Forest | Baseline metrics |
| Linear Regression | price_predictor_v1.joblib |
Linear | Fast inference |
| XGBoost | xgboost_reddit_stock_model.pkl |
Ensemble | Best accuracy |
| LSTM | Notebook only | Deep Learning | Sequential patterns |
-
Build Training Dataset
python data_processing/build_training_dataset.py
-
Train Models (run Jupyter notebooks)
jupyter notebook ml_models/
01_train_baseline_model.ipynb- Random Forest baseline03_train_LSTM_Model.ipynb- LSTM for time series05_Linear_regression_model.ipynb- Linear regression06_XGBoost_model.ipynb- XGBoost ensemble
-
Track with MLflow
# MLflow automatically tracks experiments # View at http://localhost:5000
- Stock Features: OHLCV, returns, volatility, moving averages
- Reddit Features: Post count, sentiment scores, engagement metrics
- Time Features: Hour, day of week, market hours
- Technical Indicators: RSI, MACD, Bollinger Bands
streamlit run orchestration/app.py- π Real-time Overview - Current prices, changes, Reddit activity
- π Stock Analysis - Candlestick charts, volume, returns
- π¬ Reddit Activity - Post frequency, sentiment trends
- π Correlation Analysis - Reddit sentiment vs stock movements
- π€ Predictions - ML model forecasts with accuracy metrics
- π Auto-refresh - Updates every 30 seconds
- Overview - Key metrics and combined visualization
- Stock Analysis - Price charts, volume, returns
- Reddit Activity - Post frequency and engagement
- Correlation - Heatmaps showing relationships
- Predictions - Model predictions and accuracy
# Clean all Kafka topics
python data_collection/clean_kafka_topics.py
# Read messages from topics
python data_collection/read_kafka_messages.py# Verify dataset completeness
python data_collection/verify_datasets.py
# Test Spark connectivity
python data_collection/test_spark.py# MongoDB utilities and queries
python orchestration/mongo.py- Full Dataset: September 29, 2020 - August 16, 2021 (10.5 months)
- Training: September 2020 - March 2021 (6 months)
- Simulation/Testing: April 2021 - August 2021 (4.5 months)
- Historical Context: GameStop short squeeze (Jan 2021) and meme stock era
- Reddit Posts: ~1,000,000 posts from r/wallstreetbets
- Stock Records: ~1,000 daily OHLCV records per ticker
- Total Tickers: 8 stocks (GME, AMC, TSLA, AAPL, BB, NOK, PLTR, SPCE)
reddit_raw- Raw Reddit postsstock_raw- Raw stock pricesreddit_features_15m- Aggregated 15-minute Reddit featurespredictions- Model predictions and evaluation metrics
Course: Big Data & Applications 2025-2026
Professor: Yasser El Madani El Alami
Institution: ISMAGI
| Member | Role | Responsibilities |
|---|---|---|
| Member 1 | Data Collection Engineer | Data acquisition, Kafka setup, producers |
| Member 2 | Data Processing Engineer | Spark pipelines, ETL, data cleaning |
| Member 3 | ML Engineer | Model training, feature engineering, MLflow |
| Member 4 | Integration Engineer | Dashboard, orchestration, deployment |
- Docker infrastructure (Kafka, Spark, MongoDB, Airflow, MLflow)
- Data collection scripts (Stock + Reddit)
- Historical data download (Jan-Jun 2021)
- Data splitting (train/simulate)
- Kafka topics and producers
- Spark streaming consumer
- MongoDB storage schema
- Reddit & Stock processing pipelines
- Feature engineering pipeline
- Multiple ML models (Baseline, Linear, XGBoost, LSTM)
- Streamlit dashboard with 5 tabs
- Predictor service
- Auto-refresh functionality
- Airflow DAG automation
- Model performance optimization
- Real-time prediction inference
- Advanced sentiment analysis
- Deploy to cloud (AWS/GCP)
- Add more tickers
- Implement reinforcement learning
- Real-time Twitter sentiment
- Backtesting framework
Kafka not starting
docker-compose down -v
docker-compose up -d
./init-kafka.shSpark consumer errors
# Check Spark is running
python data_collection/test_spark.py
# Check Kafka has messages
python data_collection/read_kafka_messages.pyMongoDB connection issues
# Restart MongoDB
docker-compose restart mongo
# Verify connection
python orchestration/mongo.pyDashboard not showing data
# Verify MongoDB has data
python orchestration/mongo.py
# Clear cache and refresh
# Click "Refresh Data" in sidebar- Apache Kafka Documentation
- Apache Spark Streaming Guide
- MongoDB Time Series Collections
- Streamlit Documentation
- XGBoost Documentation
This project is developed for academic purposes as part of the Big Data & Applications course.
- Professor Yasser El Madani El Alami for guidance
- r/wallstreetbets community for the interesting case study
- Kaggle for providing the Reddit dataset
- Stooq for stock market data API
Last Updated: January 2026
Version: 1.0.0
For questions or issues, please contact the team members or create an issue in the repository.