A distributed NLP pipeline for classifying Twitter sentiment (positive/negative) using Apache Spark, scikit-learn, and real-time tweet streaming. Built as an academic project (CS644 — Big Data) combining batch ML classification with live streaming ingestion.
| Area | Tools |
|---|---|
| Distributed Computing | Apache Spark (PySpark), SparkSQL |
| NLP / ML | scikit-learn, NLTK, TF-IDF, SVM |
| Real-Time Streaming | PySpark Structured Streaming |
| Data | Sentiment140 dataset (1.6M tweets) |
| Language | Python |
| File | Description |
|---|---|
main.py |
PySpark batch pipeline — loads train/test CSV with SparkSQL, renames and cleans columns, prepares data for distributed processing |
ml.py |
scikit-learn ML pipeline — TF-IDF vectorization with NLTK tokenizer + SVM classifier with 5-fold cross-validation |
streaming.py |
PySpark Structured Streaming pipeline — ingests live tweet stream and applies sentiment classification in real time |
streaming2.py |
Alternative streaming pipeline with different ingestion configuration |
trainingandtestdata/ |
Sentiment140 train and test CSV files |
PROJECT REPORT CS644.pdf |
Full academic project report |
- Loads 10,000 labeled tweets from the Sentiment140 dataset
- Tokenizes text using NLTK word tokenizer
- Applies CountVectorizer → TF-IDF transformation
- Trains a linear SVM classifier
- Evaluates with 5-fold cross-validation accuracy
- Initializes a local Spark cluster with SparkSession
- Loads train and test CSVs into Spark DataFrames
- Renames columns (Polarity, ID, Text), drops unused fields
- Handles nulls and prepares data for large-scale processing
- Uses PySpark Structured Streaming to consume a live tweet feed
- Applies the trained sentiment model to incoming records in near real-time
Sentiment140 — 1.6 million tweets labeled as positive (4) or negative (0), collected via the Twitter API.
Columns: Polarity, ID, Date, Query, User, Text
Gkeri Pepelasi