Skip to content

Phonepyaeaung/msc-data-science-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

True Colors in the Comments: Multi-Platform NLP Analysis

Project Overview

A multi-task learning approach to detecting toxicity, empathy, and cognitive maturity in social media comments from YouTube, Twitter, and Reddit using DistilBERT.

Research Questions

  1. Can a compact transformer model (DistilBERT) predict toxicity, empathy, and cognitive maturity simultaneously?
  2. Does multi-task learning outperform TF-IDF + Logistic Regression baseline?
  3. How do the model's predictions compare with GPT-3.5?

Dataset

  • Total: 450 manually annotated comments
  • Sources: YouTube (150), Reddit (150), Twitter (150)
  • Labels:
    • Toxicity: 3-class (non-toxic, neutral, toxic)
    • Empathy: 1-5 Likert scale
    • Maturity: 1-5 Likert scale

Key Findings

  • DistilBERT outperformed TF-IDF baseline in toxicity classification
  • Non-toxic comments had higher empathy and maturity scores
  • 50% agreement with GPT-3.5 on toxicity labels
  • Platform differences: YouTube (non-toxic), Twitter (toxic), Reddit (balanced)

Methodology

  • Model: Fine-tuned DistilBERT with multi-task heads
  • Architecture: Shared encoder + 3 task-specific outputs
  • Loss Function: Weighted CrossEntropy (toxicity) + MSE (empathy, maturity)
  • Training: AdamW optimizer, 13 epochs, batch size 16

Model Performance

Metric Value
Toxicity Accuracy 68.8%
Toxicity F1 (macro) 0.56
Toxic Class AUC 0.68
Agreement with GPT-3.5 50%

Files

  • truecolortest.ipynb - Full implementation and evaluation
  • Research_Project.pdf - Full academic report
  • results/ - Confusion matrices, ROC curves, and visualizations

Technologies

  • Python 3.10
  • PyTorch
  • Hugging Face Transformers (DistilBERT)
  • Scikit-learn
  • Pandas, NumPy, Matplotlib

How to Run

  1. Install dependencies
  2. Run truecolortest.ipynb
  3. View results in results/ folder

Limitations

  • Small dataset (450 samples) limits generalization
  • Class imbalance across platforms
  • No multilingual support
  • Manual annotation introduces subjectivity

Future Work

  • Expand dataset with balanced platform representation
  • Multilingual and cross-cultural analysis
  • Human-in-the-loop refinement
  • Real-time deployment for moderation systems

Author

Phone Pyae Aung - MSc Data Science, University of Exeter

References

See Research_Project.pdf for full references

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors