- Project Overview
- Contributors
- Executive Summary
- Technical Setup
- Dataset Description
- Model Families Explored
- Evaluation Metrics
- Results and Performance Analysis
- Tools and Technologies
- Implementation Notes
- Limitations and Future Work
- Conclusion
- References
IdentiLLM is a machine learning classification system designed to identify and distinguish between different Large Language Model (LLM) families based on student usage patterns and feedback. This project was developed as part of CSC311: Introduction to Machine Learning at the University of Toronto.
The primary objective of this project is to develop and evaluate machine learning models capable of classifying user experiences with three major LLM platforms: ChatGPT, Claude, and Gemini. By analyzing student responses regarding their interactions with these models, we aim to identify distinguishing characteristics that can reliably predict which LLM was being evaluated.
---
- Zhe Wang
- Virat Talan
- Yicheng Wang
This project implements and compares three machine learning model families for multi-class classification:
- Random Forests: Ensemble-based decision tree classifier
- Softmax Regression: Discriminative linear classifier for multi-class problems
- Neural Networks: Multi-layer perceptron with non-linear activation functions
Best Performing Model: Neural Networks (Test Accuracy: 67.4%)
The neural network model was selected as the final production model due to its consistent performance across different data splits and evaluation metrics. While Softmax Regression achieved slightly higher test accuracy (70.6%), the neural network demonstrated lower variance and more stable generalization to unseen data, making it the more reliable choice for deployment.
- Programming Language: Python 3.x
- Primary Libraries:
- scikit-learn (model implementation and evaluation)
- pandas (data manipulation and preprocessing)
- numpy (numerical computations)
- matplotlib/seaborn (visualization)
pip install scikit-learn pandas numpy matplotlib seabornCSC311_Project/
├── README.md
├── training_data_clean.csv
├── project_baseline.py
├── pred_example.py
├── CSC311_Project.ipynb
└── docs/
The dataset comprises student responses evaluating their experiences with three major Large Language Models (ChatGPT, Claude, and Gemini) in academic contexts. Each record represents a single student's evaluation of one specific LLM platform.
Feature Types:
- Quantitative Features: Likert-scale responses (1-5) measuring:
- Likelihood of model usage
- Perceived helpfulness
- Frequency of result verification
- Occurrence of suboptimal or incorrect responses
- Qualitative Features: Open-ended text responses describing:
- Model strengths (e.g., concept simplification, writing assistance)
- Model weaknesses (e.g., citation accuracy, factual reliability)
Class Distribution: The dataset exhibits relatively balanced class distribution across the three LLM categories, making accuracy a suitable primary evaluation metric.
Several data quality challenges were identified during initial exploration:
- Missing Data: Incomplete responses with uneven missingness patterns across labels
- Response Duplication: Students frequently provided nearly identical answers across multiple entries, introducing potential data leakage risks
- Text Inconsistencies: Irregular formatting, placeholder strings, and duplicated phrasing in open-ended responses
1. Data Cleaning
- Loaded raw data into pandas DataFrame
- Identified null and invalid entries through systematic inspection
- Applied median imputation for missing Likert-scale features
- Utilized "no response" indicators for empty text fields
2. Feature Engineering
- One-Hot Encoding: Transformed categorical Likert-scale responses into binary columns for model compatibility
- Text Vectorization: Extracted frequent, relevant keywords from open-ended responses
- Binary Feature Creation: Converted keyword presence into binary features to minimize noise while preserving signal
3. Data Splitting Strategy
- Training Set: 70%
- Validation Set: 15%
- Test Set: 15%
- Critical Constraint: All three responses from each student were kept in the same subset to prevent data leakage
4. Exploratory Data Analysis
- Generated box plots and count plots for feature distribution analysis
- Confirmed class balance across LLM categories
- Identified features with high predictive potential through visual inspection
Rationale for Selection
Random Forests were chosen for their ability to handle both categorical and continuous features while capturing complex, non-linear relationships in the data. The ensemble approach reduces variance and improves stability, making it particularly suitable for datasets with balanced class distributions.
Implementation Details
- Framework: scikit-learn's
RandomForestClassifier - Optimization Strategy: Ensemble-based learning without gradient descent
- Feature Importance: Provides interpretable feature importance rankings
Hyperparameter Search Space
| Hyperparameter | Values Explored | Rationale |
|---|---|---|
criterion |
{gini, entropy} | Compare impurity measures |
n_estimators |
{50, 100, 200, 300} | Balance ensemble size and computation |
max_depth |
{None, 5, 8, 10, 13, 15, 20, 25, 30} | Control tree complexity |
max_features |
{sqrt, log2, None} | Feature subsampling strategies |
min_samples_split |
{25, 30, 35, 40, 45} | Prevent overfitting |
min_samples_leaf |
{10, 15, 20, 25, 30, 35, 40} | Control leaf node size |
Validation Approach
Grid search over hyperparameter combinations with validation set evaluation to identify optimal configuration balancing performance and generalization.
Rationale for Selection
Softmax Regression provides a discriminative linear classifier that excels on high-dimensional sparse text features while maintaining interpretability through direct feature-weight-to-class mappings.
Implementation Details
- Framework: scikit-learn's
LogisticRegressionwith multinomial option - Solver: L-BFGS optimizer for efficient multinomial logistic loss minimization
- Regularization: L2 penalty controlled by inverse parameter C
- Convergence: max_iter = 5000 with automatic early stopping
Hyperparameter Search Space
Text Feature Extraction (TF-IDF):
| Parameter | Values | Purpose |
|---|---|---|
max_features |
{250, 500, 1000, 5000} | Vocabulary size control |
min_df |
{1, 2, 3, 4, 5} | Minimum document frequency |
ngram_range |
{(1,1), (1,2)} | Unigrams vs. unigrams + bigrams |
Classifier:
| Parameter | Values | Purpose |
|---|---|---|
C |
{0.1, 0.5, 1.0, 5.0, 10.0} | Regularization strength |
Validation Strategy
All hyperparameter tuning performed exclusively on validation set. Test set reserved for final unbiased evaluation. Larger vocabularies showed higher training accuracy but reduced validation recall and F1, indicating overfitting.
Rationale for Selection
Neural networks serve as universal function approximators capable of modeling complex, non-linear relationships between features and labels. The architecture automatically learns hierarchical feature representations, reducing the need for extensive manual feature engineering in high-dimensional feature spaces.
Implementation Details
- Framework: scikit-learn's
MLPClassifier - Optimizer: Stochastic Gradient Descent (SGD)
- Activation Function: ReLU (Rectified Linear Unit)
- Regularization: L2 penalty to prevent overfitting
- Output Layer: Softmax activation for multi-class classification
- Mini-batch Size: 60 samples (approximately 1/10 of training data)
- Early Stopping: Threshold of 300 gradient descent iterations
Hyperparameter Search Space
| Hyperparameter | Values Explored | Rationale |
|---|---|---|
| Hidden Layers | 1-3 layers | Balance capacity and overfitting risk |
| Neurons per Layer | {4, 8, 16, 32, 64, 128} | Progressive capacity scaling |
| L2 Regularization | {0.0001, 0.0005, 0.001, 0.005, 0.01} | Fine-grained regularization control |
| Learning Rate | {0.0001, 0.0005, 0.001, 0.005, 0.01} | Ensure stable convergence |
| Vocabulary Size | {5, 10, 25, 50, 100} | Limit text feature noise |
Feature Selection
Applied mutual information-based feature selection to identify and retain only the most informative features, improving model efficiency and reducing overfitting on noisy features.
Validation Approach
Cross-validation with grid search over hyperparameter combinations. Validation accuracy guided hyperparameter selection. Student-level grouping maintained across all splits to prevent data leakage.
All models were evaluated using the following metrics:
- Accuracy: Primary metric due to balanced class distribution
- Precision: Measures classification exactness (minimize false positives)
- Recall: Measures classification completeness (minimize false negatives)
- F1 Score (Macro): Harmonic mean of precision and recall, weighted equally across all classes
| Model | Test Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Random Forest | 65.8% | 63.9% | 64.2% | 62.6% |
| Softmax Regression | 70.6% | 71.5% | 70.7% | 70.5% |
| Neural Network | 67.4% | 67.3% | 67.5% | 66.8% |
Despite Softmax Regression achieving the highest test accuracy (70.6%), the Neural Network was selected as the final model for the following reasons:
- Performance Difference: The 3-4% accuracy gap falls within expected random variation given the relatively small and noisy test set
- Statistical Significance: Without formal statistical testing, the observed difference cannot be definitively attributed to superior model performance
- Consistency and Variance: Neural Network demonstrated more stable performance across different random data splits, indicating lower variance and better generalization
- Production Reliability: Lower variance makes the Neural Network more reliable for deployment on unseen data
Expected Test Accuracy: 67.4% (empirically validated across multiple random splits)
Common Error Patterns:
The confusion matrix analysis reveals systematic misclassification patterns across all three models:
Softmax Regression Confusion Matrix:
| Actual \ Predicted | ChatGPT | Claude | Gemini |
|---|---|---|---|
| ChatGPT | 38 | 2 | 2 |
| Claude | 3 | 28 | 11 |
| Gemini | 6 | 13 | 23 |
Key Observations:
-
Claude vs. Gemini Confusion:
- 13 Gemini responses misclassified as Claude
- 11 Claude responses misclassified as Gemini
- Significantly higher confusion rates compared to ChatGPT misclassifications
-
ChatGPT Distinctiveness:
- ChatGPT shows clearer separation from other models
- Lower false positive and false negative rates
-
Root Cause Analysis:
- Generic responses (e.g., "I would use it to write code") provide insufficient model-specific signal
- Short or ambiguous wording in open-ended responses
- Similar usage patterns between Claude and Gemini in academic contexts
- Python 3.x: Primary programming language
- scikit-learn: Machine learning model implementation, training, and evaluation
- pandas: Data manipulation, cleaning, and preprocessing
- numpy: Numerical computation and array operations
- matplotlib/seaborn: Data visualization and exploratory analysis
- Jupyter Notebook: Interactive development and experimentation
- Git: Version control (if applicable)
RandomForestClassifier: Random Forest implementationLogisticRegression: Softmax Regression with L-BFGS solverMLPClassifier: Neural Network with SGD optimizerTfidfVectorizer: Text feature extractiontrain_test_split: Data partitioning utilitiesGridSearchCV(custom implementation): Hyperparameter optimization
Custom preprocessing code was developed to:
- Clean text fields and remove irrelevant artifacts (e.g., "THIS MODEL" placeholders)
- Handle missing entries with appropriate imputation strategies
- Normalize merged text columns for consistent feature extraction
- Remove stop words and perform basic text normalization
- Implemented bag-of-words representation for text features
- Applied mutual information-based feature selection for neural networks
- Created one-hot encoded representations of categorical Likert-scale features
- Dataset Size: Relatively small dataset limits model capacity and generalization
- Feature Sparsity: Text features remain sparse despite preprocessing efforts
- Claude-Gemini Separation: Models struggle to distinguish between Claude and Gemini responses
- Statistical Validation: Lack of formal significance testing for model comparison
- Data Augmentation: Collect additional labeled data to improve model robustness
- Advanced Text Features: Implement contextualized embeddings (e.g., BERT, word2vec)
- Ensemble Methods: Combine predictions from multiple models for improved accuracy
- Class-Specific Features: Engineer features specifically targeting Claude vs. Gemini distinction
- Cross-Validation: Implement k-fold cross-validation for more robust performance estimates
- Hyperparameter Optimization: Explore more sophisticated optimization techniques (e.g., Bayesian optimization)
This project successfully implemented and evaluated three distinct machine learning model families for LLM classification based on user feedback. The Neural Network model, achieving 67.4% test accuracy with strong consistency across metrics, provides a reliable foundation for predicting LLM identity from student responses.
The analysis revealed that while ChatGPT responses exhibit distinctive characteristics, distinguishing between Claude and Gemini remains challenging due to similar usage patterns and generic response patterns. Future work should focus on collecting more diverse data and implementing advanced feature engineering techniques to improve inter-model discrimination.
- scikit-learn Documentation: https://scikit-learn.org/
- CSC311 Course Materials, University of Toronto
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning
This project was developed for academic purposes as part of CSC311 at the University of Toronto.
Last Updated: Fall 2025