CS5787 Deep Learning - Final Project
Cornell University, Fall 2025
- Sosai Sho (ss4525@cornell.edu)
- Franklin Dickinson (fd274@cornell.edu)
This project investigates the effectiveness of cross-domain transfer learning for fraud detection, specifically addressing:
Primary Research Question: Does transfer learning from IEEE fraud data improve fraud detection on European credit card transactions, and at what data threshold does it provide value?
Transfer learning is DATA-DEPENDENT:
- Low data (10%): +14.7% improvement (0.7680 PR-AUC)
- Full data (100%): -9.3% degradation (negative transfer)
- Average: +5.1% improvement across all splits
Cross-schema transfer WORKS:
- Successfully bridged IEEE (434 features) → European (30 PCA features)
- Text-based representation enables schema-agnostic learning
XGBoost remains SUPERIOR:
- Best: 0.8734 PR-AUC (50% data)
- Robust to extreme class imbalance (581:1)
- 11.5% better than best BERT variant
| Model | Best PR-AUC | Best Data Split | vs Vanilla BERT | vs XGBoost |
|---|---|---|---|---|
| XGBoost (tuned) | 0.8734 | 50% | N/A | Baseline |
| Fraud-BERT | 0.7680 | 10% | +14.7% | -8.1% |
| FT-Transformer | 0.6925 | 20:1 undersample | +4.5% | -20.7% |
| Vanilla BERT | 0.6940 | 25% | Baseline | -19.6% |
All metrics are PR-AUC (primary metric for imbalanced fraud detection)
- First quantification of transfer learning effectiveness across 5 data regimes (5% → 100%)
- Inverse data efficiency curve: Peak at 10%, degrades to 100%
- Interaction effect: Imbalance × Domain Mismatch × Data Scale
- First fraud detection study to transfer between different feature representations
- Text conversion bridges 434-feature IEEE → 30-feature European datasets
- Validates text-based approach for heterogeneous fraud data
- Documented negative transfer in tabular-to-text fraud domain
- Domain mismatch amplified by extreme class imbalance (581:1) at scale
- Provides cautionary evidence: more pretraining ≠ always better
if fraud_samples < 50:
use_fraud_bert() # +15% improvement
elif fraud_samples < 200:
use_vanilla_bert() # moderate performance
else:
use_xgboost() # best performance (0.87 PR-AUC)fraud-detection-transfer-learning/
├── scripts/ # Training and analysis scripts
│ ├── train_xgboost.py # XGBoost training
│ ├── tune_xgboost.py # Hyperparameter tuning
│ ├── text_conversion.py # Tabular → text conversion
│ ├── pretrain_mlm.py # BERT MLM pretraining
│ ├── train_ft_transformer_v2.py # FT-Transformer training
│ └── validate_conversions.py # Text validation
│
├── configs/ # Model configurations
│ ├── mlm_config.yaml # MLM pretraining config
│ ├── fraud_bert_config.yaml # Fraud-BERT config
│ └── ft_transformer_config.yaml # FT-Transformer configs
│
├── notebooks/ # Jupyter notebooks
│ ├── 02_xgboost_baseline.ipynb # XGBoost training
│ ├── 03_text_conversion_analysis.ipynb # Text conversion analysis
│ ├── 04a_vanilla_bert_SIMPLE.ipynb # Vanilla BERT
│ └── european_finetune_on_fraudbert.ipynb # Transfer learning (Colab)
│
├── results/ # Sample results structure
│ └── README.md # Results will be generated here
│
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md # This file
Note: Datasets and trained models are not included due to size constraints. Download instructions below.
pip install -r requirements.txt-
European Credit Card Dataset (Target)
- Download: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- Place in:
DATA/european_2013/creditcard.csv
-
IEEE-CIS Fraud Dataset (Source)
- Download: https://www.kaggle.com/c/ieee-fraud-detection/data
- Place in:
DATA/ieee_2019/
1. XGBoost Baseline (2 minutes)
python scripts/train_xgboost.py --split 0502. XGBoost Hyperparameter Tuning (30 minutes)
python scripts/tune_xgboost.py --split 0503. Convert to Text (5 minutes)
python scripts/text_conversion.py --dataset european --template natural4. BERT MLM Pretraining (8 hours on GPU)
python scripts/pretrain_mlm.py --config configs/mlm_config.yaml5. Transfer Learning (Google Colab recommended)
- Open:
notebooks/european_finetune_on_fraudbert.ipynb - Run all cells (2-3 hours per split on T4 GPU)
- European Credit Card: 284,807 transactions, 0.172% fraud (581:1 imbalance)
- IEEE-CIS Fraud: 590,540 transactions, different feature schema (434 features)
- 5 training regimes: 5%, 10%, 25%, 50%, 100%
- Fixed test set (20%) across all experiments
- Stratified sampling to maintain fraud ratio
- XGBoost: Gradient boosting with
scale_pos_weightfor imbalance - Vanilla BERT:
bert-base-uncasedwith no fraud-specific pretraining - Fraud-BERT:
bert-base-uncased+ MLM pretraining on IEEE fraud text
- Primary Metric: PR-AUC (appropriate for 581:1 imbalance)
- Secondary Metrics: ROC-AUC, F1, Precision, Recall
- Early Stopping: Based on validation PR-AUC (patience=3)
- Use Fraud-BERT when: fraud_samples < 50 (+15% improvement)
- Consider when: 50 < fraud_samples < 200 (modest gains)
- Avoid when: fraud_samples > 200 (may hurt performance)
- Recommended: XGBoost (0.87 PR-AUC, 2-min training, robust)
- Cold-Start: Fraud-BERT (0.77 PR-AUC with 10% data, enables rapid deployment)
- Trade-off: BERT 25× slower to train than XGBoost
Validated across 20× data size difference:
{
'max_depth': 6,
'learning_rate': 0.05,
'n_estimators': 300,
'subsample': 0.9,
'colsample_bytree': 0.7,
'min_child_weight': 5,
'gamma': 0.2,
'reg_alpha': 0.5,
'reg_lambda': 0.1,
}Key references for this work:
- BERT: Devlin et al. (2019) - "BERT: Pre-training of Deep Bidirectional Transformers"
- XGBoost: Chen & Guestrin (2016) - "XGBoost: A Scalable Tree Boosting System"
- European Dataset: Dal Pozzolo et al. (2015) - "Calibrating Probability with Undersampling"
- IEEE Dataset: IEEE Computational Intelligence Society (2019) - Kaggle Competition
- PR-AUC: Saito & Rehmsmeier (2015) - "The Precision-Recall Plot Is More Informative"
-
Domain Adaptation: Mitigate negative transfer at full data scale
- Adversarial domain alignment
- Multi-task learning
- Progressive fine-tuning
-
Real-Time Deployment:
- XGBoost for production (2-min training, 0.87 PR-AUC)
- Fraud-BERT for cold-start scenarios (<50 fraud samples)
-
Cross-Industry Validation:
- Healthcare fraud
- Insurance fraud
- Payment fraud
-
Explainability:
- Attention visualization for BERT
- SHAP values for XGBoost
This project is for academic purposes as part of CS5787 Deep Learning course at Cornell University.
- Course: CS5787 Deep Learning, Cornell University (Fall 2025)
- Instructor: [Course Instructor Name]
- Datasets: Kaggle European Credit Card & IEEE-CIS Fraud Detection datasets
- Compute: Google Colab (T4 GPU), RunPod (A100 GPU)
Last Updated: November 29, 2025 Status: Transfer Learning Pipeline Complete