True Colors in the Comments: Multi-Platform NLP Analysis

Project Overview

A multi-task learning approach to detecting toxicity, empathy, and cognitive maturity in social media comments from YouTube, Twitter, and Reddit using DistilBERT.

Research Questions

Can a compact transformer model (DistilBERT) predict toxicity, empathy, and cognitive maturity simultaneously?
Does multi-task learning outperform TF-IDF + Logistic Regression baseline?
How do the model's predictions compare with GPT-3.5?

Dataset

Total: 450 manually annotated comments
Sources: YouTube (150), Reddit (150), Twitter (150)
Labels:
- Toxicity: 3-class (non-toxic, neutral, toxic)
- Empathy: 1-5 Likert scale
- Maturity: 1-5 Likert scale

Key Findings

DistilBERT outperformed TF-IDF baseline in toxicity classification
Non-toxic comments had higher empathy and maturity scores
50% agreement with GPT-3.5 on toxicity labels
Platform differences: YouTube (non-toxic), Twitter (toxic), Reddit (balanced)

Methodology

Model: Fine-tuned DistilBERT with multi-task heads
Architecture: Shared encoder + 3 task-specific outputs
Loss Function: Weighted CrossEntropy (toxicity) + MSE (empathy, maturity)
Training: AdamW optimizer, 13 epochs, batch size 16

Model Performance

Metric	Value
Toxicity Accuracy	68.8%
Toxicity F1 (macro)	0.56
Toxic Class AUC	0.68
Agreement with GPT-3.5	50%

Files

truecolortest.ipynb - Full implementation and evaluation
Research_Project.pdf - Full academic report
results/ - Confusion matrices, ROC curves, and visualizations

Technologies

Python 3.10
PyTorch
Hugging Face Transformers (DistilBERT)
Scikit-learn
Pandas, NumPy, Matplotlib

How to Run

Install dependencies
Run truecolortest.ipynb
View results in results/ folder

Limitations

Small dataset (450 samples) limits generalization
Class imbalance across platforms
No multilingual support
Manual annotation introduces subjectivity

Future Work

Expand dataset with balanced platform representation
Multilingual and cross-cultural analysis
Human-in-the-loop refinement
Real-time deployment for moderation systems

Author

Phone Pyae Aung - MSc Data Science, University of Exeter

References

See Research_Project.pdf for full references

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
report		report
results		results
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

True Colors in the Comments: Multi-Platform NLP Analysis

Project Overview

Research Questions

Dataset

Key Findings

Methodology

Model Performance

Files

Technologies

How to Run

Limitations

Future Work

Author

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

True Colors in the Comments: Multi-Platform NLP Analysis

Project Overview

Research Questions

Dataset

Key Findings

Methodology

Model Performance

Files

Technologies

How to Run

Limitations

Future Work

Author

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages