Hate Speech Detection System

An end-to-end hate speech detection system that combines classical machine learning and transformer-based NLP models, exposed through a real-time FastAPI inference API.

This project was built with a strong focus on production-ready ML workflows, model benchmarking, and deployment readiness, making it suitable for real-world moderation use cases.

Problem Statement

Online platforms struggle to automatically detect hate speech and offensive language due to:

Informal and noisy text (social media)
Context-dependent language
Class imbalance across hate, offensive, and neutral content

This project addresses the problem by:

Benchmarking multiple NLP models
Leveraging transformer-based contextual embeddings
Deploying a real-time inference API

Dataset

Source: Twitter hate speech dataset
Classes:
- Hate speech
- Offensive language
- No hate or offensive language
Preprocessing: Lowercasing, URL removal, stopword removal, stemming

Dataset file:data/twitter.csv

Project Structure

hate-speech-detection/ │ ├── data/ # Dataset files ├── notebooks/ # EDA and baseline experiments ├── src/ # Preprocessing, training, evaluation scripts ├── transformers/ # Transformer (DistilBERT) training notebook ├── api/ # FastAPI inference service ├── models/ # Saved ML models and vectorizers ├── README.md └── requirements.txt

Model Pipeline

1. Text Preprocessing

Implemented a reusable preprocessing pipeline:

Lowercasing
URL and punctuation removal
Stopword removal
Stemming

Used consistently across:

Classical ML models
Transformer models
FastAPI inference API

2. Classical Machine Learning Models

Implemented and benchmarked multiple classical NLP models using TF-IDF features:

Logistic Regression (baseline)
Support Vector Machine (SVM)
Random Forest

Best classical performance:

Logistic Regression + TF-IDF
Accuracy: ~89.5%

These models provide:

Fast inference
Low memory usage
Suitability for real-time APIs

3. Transformer-Based Model (HuggingFace)

Implemented a transformer-based classifier using DistilBERT via HuggingFace.

Model: distilbert-base-uncased
Fine-tuned on the hate speech dataset
Used HuggingFace Trainer API

Performance:

Accuracy: 91.5%
Outperformed classical ML baselines
Improved contextual understanding of offensive language

Transformer training is documented in: transformers/bert_training.ipynb

Performance Comparison

Model	Accuracy
TF-IDF + Logistic Regression	~89.5%
TF-IDF + SVM	~88%
TF-IDF + Random Forest	~84%
DistilBERT (Transformer)	91.5%

Real-Time Inference API (FastAPI)

A FastAPI service exposes the trained classical ML model for real-time moderation.

Endpoint

POST /predict

Example Request

{ "text": "Let's unite and kill all the people protesting" } Example Response json Copy code { "prediction": "Hate speech" }

Why Classical ML for the API?

Faster inference Lower latency Suitable for production moderation pipelines Transformer models are retained for offline analysis and benchmarking.

Technologies Used Languages: Python ML & NLP: scikit-learn, NLTK, HuggingFace Transformers Deep Learning: PyTorch API: FastAPI, Uvicorn Deployment: Docker (optional), Render/Railway Version Control: Git, GitHub

How to Run Locally

Install dependencies

pip install -r requirements.txt

Run the API

uvicorn api.main:app --reload Open: http://127.0.0.1:8000/docs

Key Takeaways

Built a complete NLP pipeline from data to deployment Benchmarked classical ML vs transformer-based models Fine-tuned DistilBERT using HuggingFace Deployed a real-time hate speech detection API using FastAPI

Future Improvements

Deploy transformer model for async batch inference Add confidence scores and thresholds Integrate model monitoring and logging Extend to multilingual hate speech detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hate Speech Detection System

Problem Statement

Dataset

Project Structure

Model Pipeline

1. Text Preprocessing

2. Classical Machine Learning Models

3. Transformer-Based Model (HuggingFace)

Performance Comparison

Real-Time Inference API (FastAPI)

Endpoint

Example Request

Why Classical ML for the API?

How to Run Locally

Install dependencies

Run the API

Key Takeaways

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
api		api
data		data
models		models
notebooks		notebooks
src		src
transformers		transformers
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hate Speech Detection System

Problem Statement

Dataset

Project Structure

Model Pipeline

1. Text Preprocessing

2. Classical Machine Learning Models

3. Transformer-Based Model (HuggingFace)

Performance Comparison

Real-Time Inference API (FastAPI)

Endpoint

Example Request

Why Classical ML for the API?

How to Run Locally

Install dependencies

Run the API

Key Takeaways

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages