Twitter Sentiment Analysis — PySpark + NLP + Real-Time Streaming

A distributed NLP pipeline for classifying Twitter sentiment (positive/negative) using Apache Spark, scikit-learn, and real-time tweet streaming. Built as an academic project (CS644 — Big Data) combining batch ML classification with live streaming ingestion.

Tech Stack

Area	Tools
Distributed Computing	Apache Spark (PySpark), SparkSQL
NLP / ML	scikit-learn, NLTK, TF-IDF, SVM
Real-Time Streaming	PySpark Structured Streaming
Data	Sentiment140 dataset (1.6M tweets)
Language	Python

Files

File	Description
`main.py`	PySpark batch pipeline — loads train/test CSV with SparkSQL, renames and cleans columns, prepares data for distributed processing
`ml.py`	scikit-learn ML pipeline — TF-IDF vectorization with NLTK tokenizer + SVM classifier with 5-fold cross-validation
`streaming.py`	PySpark Structured Streaming pipeline — ingests live tweet stream and applies sentiment classification in real time
`streaming2.py`	Alternative streaming pipeline with different ingestion configuration
`trainingandtestdata/`	Sentiment140 train and test CSV files
`PROJECT REPORT CS644.pdf`	Full academic project report

How It Works

Batch Classification (`ml.py`)

Loads 10,000 labeled tweets from the Sentiment140 dataset
Tokenizes text using NLTK word tokenizer
Applies CountVectorizer → TF-IDF transformation
Trains a linear SVM classifier
Evaluates with 5-fold cross-validation accuracy

Distributed Batch (`main.py`)

Initializes a local Spark cluster with SparkSession
Loads train and test CSVs into Spark DataFrames
Renames columns (Polarity, ID, Text), drops unused fields
Handles nulls and prepares data for large-scale processing

Real-Time Streaming (`streaming.py`)

Uses PySpark Structured Streaming to consume a live tweet feed
Applies the trained sentiment model to incoming records in near real-time

Dataset

Sentiment140 — 1.6 million tweets labeled as positive (4) or negative (0), collected via the Twitter API.
Columns: Polarity, ID, Date, Query, User, Text

Author

Gkeri Pepelasi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Sentiment Analysis — PySpark + NLP + Real-Time Streaming

Tech Stack

Files

How It Works

Batch Classification (`ml.py`)

Distributed Batch (`main.py`)

Real-Time Streaming (`streaming.py`)

Dataset

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
trainingandtestdata		trainingandtestdata
PROJECT REPORT CS644.pdf		PROJECT REPORT CS644.pdf
README.md		README.md
main.py		main.py
ml.py		ml.py
streaming.py		streaming.py
streaming2.py		streaming2.py

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis — PySpark + NLP + Real-Time Streaming

Tech Stack

Files

How It Works

Batch Classification (ml.py)

Distributed Batch (main.py)

Real-Time Streaming (streaming.py)

Dataset

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Batch Classification (`ml.py`)

Distributed Batch (`main.py`)

Real-Time Streaming (`streaming.py`)

Packages