Skip to content

Javen-W/CSE847-MGT-Classifier

Repository files navigation

M4GT-Bench: Human vs. Machine-Generated Text Classification

CSE847 Machine Learning Course Project, Fall 2024

Overview

This project focuses on classifying text as human-generated or machine-generated using the M4GT-Bench dataset, addressing two subtasks:

  • Subtask A: Binary classification (human vs. machine-generated text, 2 classes).
  • Subtask B: Multi-class classification (identifying specific machine models or human origin, 7 classes).

The project implements and compares three machine learning approaches—DistilBERT (transformer), LSTM (recurrent neural network), and SVM (support vector machine with TF-IDF features)—to evaluate their performance on large-scale text datasets. The goal is to develop robust NLP models for applications like automated documentation analysis in manufacturing.

Dataset

  • Source: M4GT-Bench (https://github.com/mbzuai-nlp/M4GT-Bench, https://arxiv.org/pdf/2402.11175)
  • Subtask A: ~152,000 samples (text, label, model), balanced between human (0) and machine-generated (1) text.
  • Subtask B: ~152,000 samples, labeled across 7 classes (human and 6 machine models, e.g., GPT-3, LLaMA).
  • Preprocessing: Split into train (80%), validation (10%), and test (10%) sets, with subsampling (10,000 train, 1,000 validation/test) for computational efficiency.

Methodology

The project implements three models per subtask, using Python, PyTorch, Keras, and Scikit-learn:

  1. DistilBERT (TaskA_DistilBERT.ipynb, TaskB_DistilBERT.ipynb):

    • Fine-tuned DistilBERT (distilbert-base-uncased) using HuggingFace Transformers.
    • Tokenized text with padding and truncation (max length 512).
    • Trained with Trainer API, 3 epochs, cross-entropy loss, and accuracy metric.
  2. LSTM (TaskA_LSTM.ipynb, TaskB_LSTM.ipynb):

    • Built a sequential Keras model with Embedding (128D), LSTM (100 units), and Dense layers.
    • Preprocessed text using TextVectorization (10,000 vocab size, 200 max length).
    • Trained for 5 epochs (Subtask A) or 10 epochs (Subtask B), batch size 64, Adam optimizer.
  3. SVM with TF-IDF (TaskA_SVM.ipynb, TaskB_SVM.ipynb):

    • Used Scikit-learn’s SVM with linear kernel and TF-IDF features.
    • Preprocessed text with CountVectorizer (3,000 max features, min_df=2, max_df=0.7) and TfidfTransformer.
    • Evaluated with 3-fold cross-validation for robust performance metrics.

Results

  • Subtask A (Binary Classification):

    • DistilBERT: 94.1% test accuracy, eval loss 0.341.
    • LSTM: 88.61% test accuracy, loss 0.145.
    • SVM (TF-IDF): 88.78% accuracy, 90.24% F1 score (3-fold CV).
    • DistilBERT outperformed LSTM and SVM due to its transformer architecture, capturing complex text patterns.
  • Subtask B (Multi-Class Classification):

    • DistilBERT: 87.3% test accuracy, eval loss 0.576.
    • LSTM: 80.39% test accuracy, loss 0.041.
    • SVM (TF-IDF): 79.99% accuracy (3-fold CV).
    • DistilBERT again led, but multi-class complexity reduced performance compared to Subtask A.

Dependencies

  • Python 3.8+
  • Libraries: transformers, datasets, evaluate, pandas, numpy, keras, nltk, gensim, scikit-learn, tqdm
  • Install: pip install -r requirements.txt (create a requirements.txt with listed dependencies)

Usage

  1. Download Dataset:

  2. Run Notebooks:

    • TaskA_DistilBERT.ipynb: Fine-tune DistilBERT for binary classification.
    • TaskA_LSTM.ipynb: Train LSTM for binary classification.
    • TaskA_SVM.ipynb: Run SVM with TF-IDF for binary classification.
    • TaskB_DistilBERT.ipynb: Fine-tune DistilBERT for multi-class classification.
    • TaskB_LSTM.ipynb: Train LSTM for multi-class classification.
    • TaskB_SVM.ipynb: Run SVM with TF-IDF for multi-class classification.
  3. Results:

    • Outputs include accuracy, loss, and F1 scores (SVM). Check notebook outputs for detailed metrics.

About

CSE847 Machine Learning course project. LMs for Human vs. Machine Generated Text (MGT) detection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors