CSE847 Machine Learning Course Project, Fall 2024
This project focuses on classifying text as human-generated or machine-generated using the M4GT-Bench dataset, addressing two subtasks:
- Subtask A: Binary classification (human vs. machine-generated text, 2 classes).
- Subtask B: Multi-class classification (identifying specific machine models or human origin, 7 classes).
The project implements and compares three machine learning approaches—DistilBERT (transformer), LSTM (recurrent neural network), and SVM (support vector machine with TF-IDF features)—to evaluate their performance on large-scale text datasets. The goal is to develop robust NLP models for applications like automated documentation analysis in manufacturing.
- Source: M4GT-Bench (https://github.com/mbzuai-nlp/M4GT-Bench, https://arxiv.org/pdf/2402.11175)
- Subtask A: ~152,000 samples (text, label, model), balanced between human (0) and machine-generated (1) text.
- Subtask B: ~152,000 samples, labeled across 7 classes (human and 6 machine models, e.g., GPT-3, LLaMA).
- Preprocessing: Split into train (80%), validation (10%), and test (10%) sets, with subsampling (10,000 train, 1,000 validation/test) for computational efficiency.
The project implements three models per subtask, using Python, PyTorch, Keras, and Scikit-learn:
-
DistilBERT (TaskA_DistilBERT.ipynb, TaskB_DistilBERT.ipynb):
- Fine-tuned DistilBERT (
distilbert-base-uncased) using HuggingFace Transformers. - Tokenized text with padding and truncation (max length 512).
- Trained with Trainer API, 3 epochs, cross-entropy loss, and accuracy metric.
- Fine-tuned DistilBERT (
-
LSTM (TaskA_LSTM.ipynb, TaskB_LSTM.ipynb):
- Built a sequential Keras model with Embedding (128D), LSTM (100 units), and Dense layers.
- Preprocessed text using TextVectorization (10,000 vocab size, 200 max length).
- Trained for 5 epochs (Subtask A) or 10 epochs (Subtask B), batch size 64, Adam optimizer.
-
SVM with TF-IDF (TaskA_SVM.ipynb, TaskB_SVM.ipynb):
- Used Scikit-learn’s SVM with linear kernel and TF-IDF features.
- Preprocessed text with CountVectorizer (3,000 max features, min_df=2, max_df=0.7) and TfidfTransformer.
- Evaluated with 3-fold cross-validation for robust performance metrics.
-
Subtask A (Binary Classification):
- DistilBERT: 94.1% test accuracy, eval loss 0.341.
- LSTM: 88.61% test accuracy, loss 0.145.
- SVM (TF-IDF): 88.78% accuracy, 90.24% F1 score (3-fold CV).
- DistilBERT outperformed LSTM and SVM due to its transformer architecture, capturing complex text patterns.
-
Subtask B (Multi-Class Classification):
- DistilBERT: 87.3% test accuracy, eval loss 0.576.
- LSTM: 80.39% test accuracy, loss 0.041.
- SVM (TF-IDF): 79.99% accuracy (3-fold CV).
- DistilBERT again led, but multi-class complexity reduced performance compared to Subtask A.
- Python 3.8+
- Libraries:
transformers,datasets,evaluate,pandas,numpy,keras,nltk,gensim,scikit-learn,tqdm - Install:
pip install -r requirements.txt(create arequirements.txtwith listed dependencies)
-
Download Dataset:
- Obtain
SubtaskA.jsonlandSubtaskB.jsonlfrom https://github.com/mbzuai-nlp/M4GT-Bench. - Place in the project directory or update
DATA_PATHin scripts.
- Obtain
-
Run Notebooks:
- TaskA_DistilBERT.ipynb: Fine-tune DistilBERT for binary classification.
- TaskA_LSTM.ipynb: Train LSTM for binary classification.
- TaskA_SVM.ipynb: Run SVM with TF-IDF for binary classification.
- TaskB_DistilBERT.ipynb: Fine-tune DistilBERT for multi-class classification.
- TaskB_LSTM.ipynb: Train LSTM for multi-class classification.
- TaskB_SVM.ipynb: Run SVM with TF-IDF for multi-class classification.
-
Results:
- Outputs include accuracy, loss, and F1 scores (SVM). Check notebook outputs for detailed metrics.