M4GT-Bench: Human vs. Machine-Generated Text Classification

CSE847 Machine Learning Course Project, Fall 2024

Overview

This project focuses on classifying text as human-generated or machine-generated using the M4GT-Bench dataset, addressing two subtasks:

Subtask A: Binary classification (human vs. machine-generated text, 2 classes).
Subtask B: Multi-class classification (identifying specific machine models or human origin, 7 classes).

The project implements and compares three machine learning approaches—DistilBERT (transformer), LSTM (recurrent neural network), and SVM (support vector machine with TF-IDF features)—to evaluate their performance on large-scale text datasets. The goal is to develop robust NLP models for applications like automated documentation analysis in manufacturing.

Dataset

Source: M4GT-Bench (https://github.com/mbzuai-nlp/M4GT-Bench, https://arxiv.org/pdf/2402.11175)
Subtask A: ~152,000 samples (text, label, model), balanced between human (0) and machine-generated (1) text.
Subtask B: ~152,000 samples, labeled across 7 classes (human and 6 machine models, e.g., GPT-3, LLaMA).
Preprocessing: Split into train (80%), validation (10%), and test (10%) sets, with subsampling (10,000 train, 1,000 validation/test) for computational efficiency.

Methodology

The project implements three models per subtask, using Python, PyTorch, Keras, and Scikit-learn:

DistilBERT (TaskA_DistilBERT.ipynb, TaskB_DistilBERT.ipynb):
- Fine-tuned DistilBERT (distilbert-base-uncased) using HuggingFace Transformers.
- Tokenized text with padding and truncation (max length 512).
- Trained with Trainer API, 3 epochs, cross-entropy loss, and accuracy metric.
LSTM (TaskA_LSTM.ipynb, TaskB_LSTM.ipynb):
- Built a sequential Keras model with Embedding (128D), LSTM (100 units), and Dense layers.
- Preprocessed text using TextVectorization (10,000 vocab size, 200 max length).
- Trained for 5 epochs (Subtask A) or 10 epochs (Subtask B), batch size 64, Adam optimizer.
SVM with TF-IDF (TaskA_SVM.ipynb, TaskB_SVM.ipynb):
- Used Scikit-learn’s SVM with linear kernel and TF-IDF features.
- Preprocessed text with CountVectorizer (3,000 max features, min_df=2, max_df=0.7) and TfidfTransformer.
- Evaluated with 3-fold cross-validation for robust performance metrics.

Results

Subtask A (Binary Classification):
- DistilBERT: 94.1% test accuracy, eval loss 0.341.
- LSTM: 88.61% test accuracy, loss 0.145.
- SVM (TF-IDF): 88.78% accuracy, 90.24% F1 score (3-fold CV).
- DistilBERT outperformed LSTM and SVM due to its transformer architecture, capturing complex text patterns.
Subtask B (Multi-Class Classification):
- DistilBERT: 87.3% test accuracy, eval loss 0.576.
- LSTM: 80.39% test accuracy, loss 0.041.
- SVM (TF-IDF): 79.99% accuracy (3-fold CV).
- DistilBERT again led, but multi-class complexity reduced performance compared to Subtask A.

Dependencies

Python 3.8+
Libraries: transformers, datasets, evaluate, pandas, numpy, keras, nltk, gensim, scikit-learn, tqdm
Install: pip install -r requirements.txt (create a requirements.txt with listed dependencies)

Usage

Download Dataset:
- Obtain SubtaskA.jsonl and SubtaskB.jsonl from https://github.com/mbzuai-nlp/M4GT-Bench.
- Place in the project directory or update DATA_PATH in scripts.
Run Notebooks:
- TaskA_DistilBERT.ipynb: Fine-tune DistilBERT for binary classification.
- TaskA_LSTM.ipynb: Train LSTM for binary classification.
- TaskA_SVM.ipynb: Run SVM with TF-IDF for binary classification.
- TaskB_DistilBERT.ipynb: Fine-tune DistilBERT for multi-class classification.
- TaskB_LSTM.ipynb: Train LSTM for multi-class classification.
- TaskB_SVM.ipynb: Run SVM with TF-IDF for multi-class classification.
Results:
- Outputs include accuracy, loss, and F1 scores (SVM). Check notebook outputs for detailed metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
models		models
.gitignore		.gitignore
CSE847_Project_Final_Report.pdf		CSE847_Project_Final_Report.pdf
README.md		README.md
TaskA_DistilBERT.ipynb		TaskA_DistilBERT.ipynb
TaskA_DistilBERT.pdf		TaskA_DistilBERT.pdf
TaskA_LSTM.ipynb		TaskA_LSTM.ipynb
TaskA_LSTM.pdf		TaskA_LSTM.pdf
TaskA_SVM.ipynb		TaskA_SVM.ipynb
TaskA_SVM.pdf		TaskA_SVM.pdf
TaskB_DistilBERT.ipynb		TaskB_DistilBERT.ipynb
TaskB_DistilBERT.pdf		TaskB_DistilBERT.pdf
TaskB_LSTM.ipynb		TaskB_LSTM.ipynb
TaskB_LSTM.pdf		TaskB_LSTM.pdf
TaskB_SVM.ipynb		TaskB_SVM.ipynb
TaskB_SVM.pdf		TaskB_SVM.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

M4GT-Bench: Human vs. Machine-Generated Text Classification

Overview

Dataset

Methodology

Results

Dependencies

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

M4GT-Bench: Human vs. Machine-Generated Text Classification

Overview

Dataset

Methodology

Results

Dependencies

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages