Skip to content

fedickinson/fraud-detection-transfer-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transfer Learning for Fraud Detection: A Data-Dependent Analysis

CS5787 Deep Learning - Final Project
Cornell University, Fall 2025


Team


Project Overview

This project investigates the effectiveness of cross-domain transfer learning for fraud detection, specifically addressing:

Primary Research Question: Does transfer learning from IEEE fraud data improve fraud detection on European credit card transactions, and at what data threshold does it provide value?

Key Findings

Transfer learning is DATA-DEPENDENT:

  • Low data (10%): +14.7% improvement (0.7680 PR-AUC)
  • Full data (100%): -9.3% degradation (negative transfer)
  • Average: +5.1% improvement across all splits

Cross-schema transfer WORKS:

  • Successfully bridged IEEE (434 features) → European (30 PCA features)
  • Text-based representation enables schema-agnostic learning

XGBoost remains SUPERIOR:

  • Best: 0.8734 PR-AUC (50% data)
  • Robust to extreme class imbalance (581:1)
  • 11.5% better than best BERT variant

Results Summary

Model Best PR-AUC Best Data Split vs Vanilla BERT vs XGBoost
XGBoost (tuned) 0.8734 50% N/A Baseline
Fraud-BERT 0.7680 10% +14.7% -8.1%
FT-Transformer 0.6925 20:1 undersample +4.5% -20.7%
Vanilla BERT 0.6940 25% Baseline -19.6%

All metrics are PR-AUC (primary metric for imbalanced fraud detection)


Research Contributions

1. Data-Dependent Transfer Learning

  • First quantification of transfer learning effectiveness across 5 data regimes (5% → 100%)
  • Inverse data efficiency curve: Peak at 10%, degrades to 100%
  • Interaction effect: Imbalance × Domain Mismatch × Data Scale

2. Cross-Schema Transfer Learning

  • First fraud detection study to transfer between different feature representations
  • Text conversion bridges 434-feature IEEE → 30-feature European datasets
  • Validates text-based approach for heterogeneous fraud data

3. Negative Transfer Discovery

  • Documented negative transfer in tabular-to-text fraud domain
  • Domain mismatch amplified by extreme class imbalance (581:1) at scale
  • Provides cautionary evidence: more pretraining ≠ always better

4. Practical Decision Framework

if fraud_samples < 50:
    use_fraud_bert()  # +15% improvement
elif fraud_samples < 200:
    use_vanilla_bert()  # moderate performance
else:
    use_xgboost()  # best performance (0.87 PR-AUC)

Repository Structure

fraud-detection-transfer-learning/
├── scripts/                       # Training and analysis scripts
│   ├── train_xgboost.py          # XGBoost training
│   ├── tune_xgboost.py           # Hyperparameter tuning
│   ├── text_conversion.py        # Tabular → text conversion
│   ├── pretrain_mlm.py           # BERT MLM pretraining
│   ├── train_ft_transformer_v2.py # FT-Transformer training
│   └── validate_conversions.py   # Text validation
│
├── configs/                       # Model configurations
│   ├── mlm_config.yaml           # MLM pretraining config
│   ├── fraud_bert_config.yaml    # Fraud-BERT config
│   └── ft_transformer_config.yaml # FT-Transformer configs
│
├── notebooks/                     # Jupyter notebooks
│   ├── 02_xgboost_baseline.ipynb  # XGBoost training
│   ├── 03_text_conversion_analysis.ipynb  # Text conversion analysis
│   ├── 04a_vanilla_bert_SIMPLE.ipynb     # Vanilla BERT
│   └── european_finetune_on_fraudbert.ipynb  # Transfer learning (Colab)
│
├── results/                       # Sample results structure
│   └── README.md                 # Results will be generated here
│
├── requirements.txt               # Python dependencies
├── LICENSE                        # MIT License
└── README.md                     # This file

Note: Datasets and trained models are not included due to size constraints. Download instructions below.


Quick Start

Prerequisites

pip install -r requirements.txt

Datasets

  1. European Credit Card Dataset (Target)

  2. IEEE-CIS Fraud Dataset (Source)

Train Models

1. XGBoost Baseline (2 minutes)

python scripts/train_xgboost.py --split 050

2. XGBoost Hyperparameter Tuning (30 minutes)

python scripts/tune_xgboost.py --split 050

3. Convert to Text (5 minutes)

python scripts/text_conversion.py --dataset european --template natural

4. BERT MLM Pretraining (8 hours on GPU)

python scripts/pretrain_mlm.py --config configs/mlm_config.yaml

5. Transfer Learning (Google Colab recommended)

  • Open: notebooks/european_finetune_on_fraudbert.ipynb
  • Run all cells (2-3 hours per split on T4 GPU)

Experimental Methodology

Datasets

  • European Credit Card: 284,807 transactions, 0.172% fraud (581:1 imbalance)
  • IEEE-CIS Fraud: 590,540 transactions, different feature schema (434 features)

Data Splits

  • 5 training regimes: 5%, 10%, 25%, 50%, 100%
  • Fixed test set (20%) across all experiments
  • Stratified sampling to maintain fraud ratio

Models

  1. XGBoost: Gradient boosting with scale_pos_weight for imbalance
  2. Vanilla BERT: bert-base-uncased with no fraud-specific pretraining
  3. Fraud-BERT: bert-base-uncased + MLM pretraining on IEEE fraud text

Evaluation

  • Primary Metric: PR-AUC (appropriate for 581:1 imbalance)
  • Secondary Metrics: ROC-AUC, F1, Precision, Recall
  • Early Stopping: Based on validation PR-AUC (patience=3)

Key Takeaways for Practitioners

1. When to Use Transfer Learning

- Use Fraud-BERT when: fraud_samples < 50  (+15% improvement)
- Consider when: 50 < fraud_samples < 200  (modest gains)
- Avoid when: fraud_samples > 200  (may hurt performance)

2. For Production Deployment

  • Recommended: XGBoost (0.87 PR-AUC, 2-min training, robust)
  • Cold-Start: Fraud-BERT (0.77 PR-AUC with 10% data, enables rapid deployment)
  • Trade-off: BERT 25× slower to train than XGBoost

3. XGBoost Universal Configuration

Validated across 20× data size difference:

{
    'max_depth': 6,
    'learning_rate': 0.05,
    'n_estimators': 300,
    'subsample': 0.9,
    'colsample_bytree': 0.7,
    'min_child_weight': 5,
    'gamma': 0.2,
    'reg_alpha': 0.5,
    'reg_lambda': 0.1,
}

Citations

Key references for this work:

  1. BERT: Devlin et al. (2019) - "BERT: Pre-training of Deep Bidirectional Transformers"
  2. XGBoost: Chen & Guestrin (2016) - "XGBoost: A Scalable Tree Boosting System"
  3. European Dataset: Dal Pozzolo et al. (2015) - "Calibrating Probability with Undersampling"
  4. IEEE Dataset: IEEE Computational Intelligence Society (2019) - Kaggle Competition
  5. PR-AUC: Saito & Rehmsmeier (2015) - "The Precision-Recall Plot Is More Informative"

Future Work

  1. Domain Adaptation: Mitigate negative transfer at full data scale

    • Adversarial domain alignment
    • Multi-task learning
    • Progressive fine-tuning
  2. Real-Time Deployment:

    • XGBoost for production (2-min training, 0.87 PR-AUC)
    • Fraud-BERT for cold-start scenarios (<50 fraud samples)
  3. Cross-Industry Validation:

    • Healthcare fraud
    • Insurance fraud
    • Payment fraud
  4. Explainability:

    • Attention visualization for BERT
    • SHAP values for XGBoost

License

This project is for academic purposes as part of CS5787 Deep Learning course at Cornell University.


Acknowledgments

  • Course: CS5787 Deep Learning, Cornell University (Fall 2025)
  • Instructor: [Course Instructor Name]
  • Datasets: Kaggle European Credit Card & IEEE-CIS Fraud Detection datasets
  • Compute: Google Colab (T4 GPU), RunPod (A100 GPU)

Last Updated: November 29, 2025 Status: Transfer Learning Pipeline Complete

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published