Twitter Big Data Analysis Project
Overview
This project demonstrates the analysis of a large-scale Twitter dataset using Apache Spark. The main goal is to process, clean, and analyze the data to extract meaningful insights such as tweet trends, sentiment distribution, and word frequencies. The project also includes a machine learning pipeline to perform sentiment analysis using logistic regression.
Features
• Big Data Handling: Efficiently processes large datasets using PySpark.
• Data Cleaning and Preparation: Cleans and preprocesses raw Twitter data.
• Time-Series Analysis: Visualizes tweet volume trends over time.
• Sentiment Analysis: Analyzes and visualizes sentiment distribution across tweets.
• Natural Language Processing (NLP): Tokenizes tweets, removes stopwords, and identifies word frequencies.
• Machine Learning: Builds a sentiment prediction model using logistic regression.
Technologies Used
• Apache Spark: For distributed data processing.
• PySpark MLlib: For machine learning and feature engineering.
• Matplotlib: For visualizing trends and insights.
• Python: Primary programming language for implementation.
Project Workflow
- Set Up Spark Session o Configured with memory allocation and legacy time parser policy for efficient big data handling.
- Load and Preprocess Data o Loads a CSV dataset containing tweets. o Renames columns and selects relevant fields for analysis. o Converts timestamps to a consistent format and handles missing values.
- Data Analysis o Analyz tweet volume over time and visualizes it. o Groups and visualizes sentiment distribution (positive vs. negative).
- NLP and Word Frequency Analysis o Tokenizes tweet text into individual words. o Removes stopwords to focus on meaningful terms. o Counts and visualizes the most frequent words in tweets.
- Machine Learning Pipeline o Builds a pipeline with tokenization, stopword removal, feature vectorization, and logistic regression. o Trains and evaluates a sentiment analysis model.
Installation
- Install Python (version 3.8 or higher recommended).
- Install required libraries:
Usage
- Run the notebook file (Project.ipynb) using Jupyter Notebook or Google Colab.
- The script processes the dataset and generates visualizations for trends, sentiment, and word frequencies.
- View model predictions for sentiment classification.
Example Visualizations
- Tweet Volume Over Time: Shows trends in tweet activity across timestamps.
- Sentiment Distribution: Bar chart displaying positive and negative sentiment counts.
- Top Words: Highlights the most frequently used words in tweets.
Dataset
• The dataset used contains: o Sentiment labels (0 for Negative, 4 for Positive). o User metadata and original tweet content.
• File: Twitter_Dataset.csv.
Contribution
Feel free to fork this repository and contribute. Submit pull requests for any improvements or new features.
License
This project is licensed under the MIT License. See the LICENSE file for details.