🕵 AI vs. Human: Abstract Classification System

Overview

This project is a machine learning-based classification system designed to distinguish between Human-written and AI-generated academic abstracts. Specifically tailored for the domain of Computer Science, Deep Learning, Machine Learning, and Transformers.

With the rise of Large Language Models (LLMs) like Llama 3 and Mistral, distinguishing synthetic text from organic academic writing has become a critical challenge. This project leverages classical machine learning algorithms and TF-IDF vectorization to detect AI-generated content with high accuracy.

Key Features

Domain Specific: Specialized in technical and academic texts (CS/AI/Tech papers).
Multi-Model Approach: Trains and compares 8 different algorithms (Random Forest, SVM, MLP, etc.) to find the best performer.
End-to-End Pipeline: Automated data scraping, cleaning, preprocessing, training, and evaluation.
Full-Stack Application: Includes a Python-based Backend API and a responsive Frontend for real-time analysis.
Detailed Visualization: Features confusion matrices, word clouds, and feature importance charts.

Dataset & Methodology

1. Data Collection

A custom dataset was curated focusing on English academic abstracts:

Human Data: Scraped from Wikipedia (CS/Tech articles), CNN, and academic repositories using wikipedia.py.
AI Data: Generated using Llama 3 and Mistral-Nemo models via custom scripts (metallama-3.py, mistral-nemo.py).

2. Preprocessing (`data_cleaner.py`)

Removal of special characters, HTML tags, and stop words.
Text normalization.
Vectorization: Utilizing TF-IDF to convert text into numerical feature vectors.

3. Machine Learning Models

The following models are trained and serialized in the saved_models/ directory:

✅ Random Forest Classifier
✅ Support Vector Machine (Linear SVM)
✅ Logistic Regression
✅ Neural Network (MLP)
✅ Decision Tree
✅ AdaBoost & Gradient Boosting
✅ Naive Bayes

Performance & Results

Best Model: Random Forest (Typical performance for this dataset structure).

Detailed metrics are available in machine-learning/train-test/visualization-results.

Model	Accuracy	Precision	Recall	F1-Score
Random Forest	~96%	0.95	0.97	0.96
Linear SVM	~94%	0.93	0.94	0.93
Logistic Regression	~92%	0.91	0.92	0.91
Naive Bayes	~88%	0.88	0.90	0.89

Tech Stack

Language: Python 3.12, JavaScript
ML Libraries: Scikit-learn, NumPy, Pandas, Joblib
Backend: Custom Python API Server
Frontend: HTML5, CSS3, Vanilla JS
Tools: Beautiful Soup (Scraping), Requests

Project Structure

AI-Human-Detector/
├── app/
│   ├── backend/             # API Server & Services
│   │   ├── APIServer.py
│   │   ├── services/        # Logic & Model Manager
│   │   └── schemas.py
│   └── frontend/            # Web UI
│       ├── index.html
│       ├── js/              # UI & API Logic
│       └── images/          # Charts & Results
├── machine-learning/
│   ├── data-collection-codes/ # Scrapers (Llama, Wiki, etc.)
│   ├── raw-datasets/          # Collected CSVs
│   └── train-test/
│       ├── preprocess/        # Cleaning & Vectorization
│       ├── train/             # Training Scripts
│       ├── saved_models/      # .pkl Models
│       └── visualization-results/ # Charts & Reports
└── docs/                      # Documentation

Installation & Usage

Prerequisites

Python 3.10+
pip

1. Clone the Repository

git clone https://github.com/mo-tunn/humanorai
cd ai-human-detector

2. Install Dependencies

# Install required packages (Generate requirements.txt first if missing)
pip install -r requirements.txt

3. Run the Backend

cd app/backend
python APIServer.py

4. Launch the Frontend

Open app/frontend/index.html in your browser.

Visuals

Word Clouds (AI vs Human)

Model Comparison

Contributing

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
app		app
docs		docs
machine-learning		machine-learning
README.md		README.md
intro.pdf		intro.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕵 AI vs. Human: Abstract Classification System

Overview

Key Features

Dataset & Methodology

1. Data Collection

2. Preprocessing (`data_cleaner.py`)

3. Machine Learning Models

Performance & Results

Tech Stack

Project Structure

Installation & Usage

Prerequisites

1. Clone the Repository

2. Install Dependencies

3. Run the Backend

4. Launch the Frontend

Visuals

Word Clouds (AI vs Human)

Model Comparison

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕵 AI vs. Human: Abstract Classification System

Overview

Key Features

Dataset & Methodology

1. Data Collection

2. Preprocessing (data_cleaner.py)

3. Machine Learning Models

Performance & Results

Tech Stack

Project Structure

Installation & Usage

Prerequisites

1. Clone the Repository

2. Install Dependencies

3. Run the Backend

4. Launch the Frontend

Visuals

Word Clouds (AI vs Human)

Model Comparison

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Preprocessing (`data_cleaner.py`)

Packages