GitHub - SmitPatel06/BigData

Twitter Big Data Analysis Project

Overview

This project demonstrates the analysis of a large-scale Twitter dataset using Apache Spark. The main goal is to process, clean, and analyze the data to extract meaningful insights such as tweet trends, sentiment distribution, and word frequencies. The project also includes a machine learning pipeline to perform sentiment analysis using logistic regression.

Features

• Big Data Handling: Efficiently processes large datasets using PySpark.

• Data Cleaning and Preparation: Cleans and preprocesses raw Twitter data.

• Time-Series Analysis: Visualizes tweet volume trends over time.

• Sentiment Analysis: Analyzes and visualizes sentiment distribution across tweets.

• Natural Language Processing (NLP): Tokenizes tweets, removes stopwords, and identifies word frequencies.

• Machine Learning: Builds a sentiment prediction model using logistic regression.

Technologies Used

• Apache Spark: For distributed data processing.

• PySpark MLlib: For machine learning and feature engineering.

• Matplotlib: For visualizing trends and insights.

• Python: Primary programming language for implementation.

Project Workflow

Set Up Spark Session o Configured with memory allocation and legacy time parser policy for efficient big data handling.
Load and Preprocess Data o Loads a CSV dataset containing tweets. o Renames columns and selects relevant fields for analysis. o Converts timestamps to a consistent format and handles missing values.
Data Analysis o Analyz tweet volume over time and visualizes it. o Groups and visualizes sentiment distribution (positive vs. negative).
NLP and Word Frequency Analysis o Tokenizes tweet text into individual words. o Removes stopwords to focus on meaningful terms. o Counts and visualizes the most frequent words in tweets.
Machine Learning Pipeline o Builds a pipeline with tokenization, stopword removal, feature vectorization, and logistic regression. o Trains and evaluates a sentiment analysis model.

Installation

Install Python (version 3.8 or higher recommended).
Install required libraries:

Usage

Run the notebook file (Project.ipynb) using Jupyter Notebook or Google Colab.
The script processes the dataset and generates visualizations for trends, sentiment, and word frequencies.
View model predictions for sentiment classification.

Example Visualizations

Tweet Volume Over Time: Shows trends in tweet activity across timestamps.
Sentiment Distribution: Bar chart displaying positive and negative sentiment counts.
Top Words: Highlights the most frequently used words in tweets.

Dataset

• The dataset used contains: o Sentiment labels (0 for Negative, 4 for Positive). o User metadata and original tweet content.

• File: Twitter_Dataset.csv.

Contribution

Feel free to fork this repository and contribute. Submit pull requests for any improvements or new features.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitattributes		.gitattributes
Project.ipynb		Project.ipynb
README.md		README.md
training.csv		training.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages