A machine learning project comparing supervised and unsupervised fraud detection across two datasets: automobile insurance claims (Fraud Oracle) and Medicare provider billing (Healthcare Fraud). Models include Logistic Regression, Neural Network, XGBoost, and Isolation Forest.
ML-Project/
├── dataset/
│ ├── fraud_oracle_preprocessed.csv # Preprocessed Oracle dataset
│ └── healthcare_fraud_preprocessed.csv # Aggregated provider-level dataset
├── models/
│ ├── LogisticRegression.py # Healthcare — Logistic Regression
│ ├── NeuralNetworks.py # Healthcare — Neural Network (k-fold tuned)
│ ├── XGBoost.py # Healthcare — XGBoost (k-fold tuned)
│ └── IsolationForest.py # Healthcare — Isolation Forest (unsupervised)
├── utils/
│ ├── preprocess_fraud_oracle.py # Oracle feature engineering
│ ├── preprocess_healthcare_fraud.py # Healthcare join + aggregation pipeline
│ └── visualize.py # Generate report figures → figures/
├── fraud_oracle_results.txt # Full Oracle results for all models
├── healthcare_results.txt # Full Healthcare results for all models
├── requirements.txt
└── README.md
Source: Vehicle Insurance Fraud Detection — Figshare
Rows: 15,420 insurance claims | Target: FraudFound_P (binary)
Fraud rate: ~6% (highly imbalanced)
Data split: 80/10/20 stratified — 11,102 train / 1,234 validation / 3,084 test (185 fraud in test)
Preprocessing (utils/preprocess_fraud_oracle.py):
- Ordinal string ranges (age, price, vehicle age) mapped to numeric midpoints
- Cyclical sin/cos encoding for month and day-of-week features to preserve periodicity
- Derived features:
Driver_Vehicle_Age_Diff,PolicyHolder_Driver_Age_Diff,Days_Accident_to_Claim,Claim_Before_Accident,Has_Past_Claims,New_Vehicle,Young_Driver,High_Value_Vehicle - Cyclical weekend indicators:
Accident_Weekend,Claim_Weekend - Binary encoding for
PoliceReportFiled,WitnessPresent - Rep-level claim count (
Rep_Claim_Count) as a network frequency feature - Redundant binary flags removed (duplicates of categorical columns)
- Final feature set: 34 numerical + 9 categorical features
Source: Healthcare Provider Fraud Detection — Kaggle Structure: 4 relational CSV files joined by foreign keys
| File | Rows | Description |
|---|---|---|
| Train labels | 5,410 | Provider ID + PotentialFraud (Yes/No) |
| Beneficiary data | 138,556 | Patient demographics, chronic conditions |
| Inpatient claims | 40,474 | Hospital admissions with diagnosis/procedure codes |
| Outpatient claims | 517,737 | Outpatient visits with diagnosis codes |
Fraud rate: ~9.4% | Target: Provider-level fraud (one row per provider)
Data split (supervised models): 80/20 stratified — 4,328 train+val / 1,082 test (101 fraud in test); within train+val, 5-fold CV is used for hyperparameter tuning, then a 90/10 split for final training/early-stopping.
Preprocessing (utils/preprocess_healthcare_fraud.py):
The multi-file structure is handled as a relational join pipeline:
- Beneficiary engineering — compute patient age from DOB (reference date 2009-12-31), recode chronic conditions (1/2 → 1/0), flag deceased patients, compute
Chronic_Count, recodeRenalDiseaseIndicator(Y/0 → 1/0) - Claim-level engineering — compute
ClaimDuration,HospitalStay(inpatient only),DiagnosisCount,ProcedureCount,SamePhysician(attending == operating, a self-referral flag) - JOIN — attach beneficiary demographics to each claim via
BeneID(left join) - GROUP BY Provider — aggregate 558,211 combined claims into one row per provider with ~40 features
Key aggregated features include:
- Volume:
total_claims,ip_claims,op_claims,ip_ratio,claims_per_bene,unique_benes - Billing:
total_reimbursed,avg_claim_reimbursed,max_claim_reimbursed,total_deductible,reimbursed_per_bene - Temporal:
avg_claim_duration,max_claim_duration,avg_hospital_stay,max_hospital_stay - Network:
unique_attending,unique_operating,pct_same_physician,physicians_per_bene,avg_diagnosis_count,avg_procedure_count - Patient profile:
avg_bene_age,pct_deceased,avg_chronic_count,pct_renal_disease,avg_ip_annual_reimb,avg_op_annual_reimb,avg_months_partA,avg_months_partB, per-condition prevalence rates (12 chronic condition columns)
| Model | Architecture / Key Details |
|---|---|
| Logistic Regression | PyTorch nn.Linear(n_features, 1), weighted BCE loss (pos_weight = n_neg/n_pos), Adam lr=0.001, early stopping on val ROC-AUC (patience=50), threshold optimised for F2 on validation set |
| Neural Network | Variable hidden layers + BatchNorm + ReLU + Dropout, weighted BCE, Adam with weight decay, early stopping on val F2 (patience=30/50), architecture and hyperparameters selected by 5-fold CV maximising F2 |
| XGBoost | binary:logistic, scale_pos_weight = n_neg/n_pos, tree_method=hist, early stopping on val AUC (30 rounds during CV, 50 rounds for final model), threshold optimised for F2 on validation set; hyperparameters selected by 5-fold CV maximising F2 |
| Model | Key Details |
|---|---|
| Isolation Forest | Trained on full training set (fraud and non-fraud), n_estimators=200, contamination=0.35 (high value chosen to maximise recall — flags ~35% of samples as anomalous), anomaly score = -score_samples() for ROC AUC evaluation |
Note on the Autoencoder: An Autoencoder (PyTorch encoder-decoder trained on non-fraud only, evaluated via reconstruction error) was explored as an additional unsupervised baseline on the Fraud Oracle dataset. It achieved ROC AUC 0.560 on Oracle, marginally above Isolation Forest (0.507). Neither method detected meaningful anomaly structure in the flat, per-claim Oracle rows. Both models underperformed significantly on that dataset and the Autoencoder was not carried forward to the Healthcare pipeline. Results are documented in
fraud_oracle_results.txt.
Neural Network (selected by 5-fold CV, metric: mean F2):
| Hyperparameter | Best Value |
|---|---|
| Hidden sizes | (64, 32, 16) |
| Dropout | 0.2 |
| Learning rate | 0.001 |
| Weight decay | 1e-4 |
| Best CV mean F2 | 0.7509 |
XGBoost (selected by 5-fold CV over 72 combinations, metric: mean F2):
| Hyperparameter | Best Value |
|---|---|
| max_depth | 6 |
| learning_rate | 0.05 |
| subsample | 0.9 |
| min_child_weight | 3 |
| reg_lambda | 3.0 |
| n_estimators | 1000 (early stopped at 33) |
| Best CV mean F2 | 0.7670 |
- Split: 80% train+val / 20% test (stratified), held-out test never touched during tuning
- Scaling:
StandardScalerfit on training fold only, applied to val/test — prevents leakage - Class imbalance:
pos_weight = n_neg / n_posin BCE loss;scale_pos_weightin XGBoost - Thresholds: Single decision threshold selected by maximising F2 on the validation set. F2 weights recall twice as heavily as precision (
β=2), appropriate for fraud detection where missing a fraudster is more costly than a false alarm. Threshold search over 50 values in [0.05, 0.95]. - Tuning metric: F2 (Healthcare); F1 was used in earlier Oracle experiments before the project focus shifted
ROC curves generated on a shared test split (random_state=42). Supervised model AUCs are slightly lower than the table values below because
visualize.pyuses an unstratified split to match the original Isolation Forest evaluation. Relative ordering and the supervised vs unsupervised gap are preserved.
Models were evaluated on the held-out test set of 3,084 claims (185 fraud, ~6%). Thresholds were chosen to maximise F1 (Oracle experiments). XGBoost results are shown for the recall-priority threshold (target recall ≥ 0.91) as it best captures fraud in a high-class-imbalance setting.
| Model | ROC AUC | Recall | Precision | F1 | F2 |
|---|---|---|---|---|---|
| Logistic Regression | 0.783 | 0.935 | 0.120 | 0.212 | 0.391 |
| Neural Network | 0.810 | 0.481 | 0.192 | 0.274 | 0.373 |
| XGBoost (recall-priority) | 0.848 | 0.946 | 0.137 | 0.240 | 0.434 |
| XGBoost (max-F1 threshold) | 0.848 | 0.400 | 0.220 | 0.284 | 0.344 |
| Isolation Forest (unsupervised) | 0.507 | 0.568 | 0.068 | 0.121 | 0.230 |
| Autoencoder (unsupervised) | 0.560 | 0.481 | 0.071 | 0.124 | 0.223 |
Oracle confusion matrices (test set: 2,899 non-fraud / 185 fraud):
| Model | TN | FP | FN | TP |
|---|---|---|---|---|
| Logistic Regression | 1,625 | 1,274 | 12 | 173 |
| Neural Network | 2,524 | 375 | 96 | 89 |
| XGBoost (recall-priority) | 1,798 | 1,101 | 10 | 175 |
| XGBoost (max-F1) | 2,636 | 263 | 111 | 74 |
| Isolation Forest | 1,457 | 1,442 | 80 | 105 |
Models evaluated on the held-out test set of 1,082 providers (101 fraud, ~9.4%). Thresholds chosen to maximise F2.
| Model | ROC AUC | Recall | Precision | F1 | F2 | Threshold |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.9618 | 0.8515 | 0.5513 | 0.6693 | 0.7679 | 0.7112 |
| Neural Network | 0.9636 | 0.8416 | 0.5862 | 0.6911 | 0.7741 | 0.7663 |
| XGBoost | 0.9703 | 0.9406 | 0.5108 | 0.6620 | 0.8051 | 0.1418 |
| Isolation Forest (unsupervised) | 0.7918 | 0.7714 | 0.2104 | 0.3306 | 0.5031 | — |
Healthcare confusion matrices (test set: 981 non-fraud / 101 fraud):
| Model | TN | FP | FN | TP |
|---|---|---|---|---|
| Logistic Regression | 911 | 70 | 15 | 86 |
| Neural Network | 921 | 60 | 16 | 85 |
| XGBoost | 890 | 91 | 6 | 95 |
| Isolation Forest | 673 | 304 | 24 | 81 |
-
Provider-level aggregation unlocks signals that claim-level data cannot. The healthcare dataset groups 558k claims into 5,410 provider rows, exposing billing patterns (volume, reimbursement rates, self-referral frequency) that are invisible at the individual claim level. Combined with temporal features (claim duration, hospital stay) and network features (unique physicians, patients per provider), this gives models a fundamentally richer picture of fraud than Oracle's flat, unlinked claim rows — and is the primary reason healthcare models achieve ROC AUC ~0.97 vs ~0.85 on Oracle.
-
Fraud Oracle hit a structural ceiling. All supervised models clustered between 0.78–0.85 ROC AUC regardless of architecture or tuning. The dataset has no linking IDs across claims, so it is impossible to aggregate a claimant's billing history or build temporal/network features — every claim is evaluated in isolation with no context about past behaviour.
-
XGBoost consistently outperforms neural networks on tabular data. Tree models are optimised for axis-aligned feature splits, which is how fraud manifests in structured records. The NN performance gap was smaller on the healthcare dataset (0.007 ROC AUC gap) than on Oracle (0.038 gap), as richer provider-level features reduce the NN's relative disadvantage by giving it more discriminative signal.
-
Unsupervised methods reflect the data's anomaly structure. On Oracle, Isolation Forest ROC AUC was 0.507 (effectively random) — fraud cases are not statistical outliers in a flat per-claim row. On Healthcare it achieved 0.792 — fraudulent providers genuinely bill anomalously when their full claim history is aggregated. The unsupervised performance gap directly measures how "detectable without labels" the fraud is in each dataset's feature space.
-
Feature engineering from relational joins is the highest-leverage action.
total_reimbursedalone accounted for ~30.8% of XGBoost feature importance on the healthcare dataset. This feature required joining 4 CSV files and aggregating 558k claims — it cannot be derived from a flat single-table dataset. No amount of modelling sophistication compensates for missing this aggregation step.
-
Class weights matter less when fraud is easier to separate. On Healthcare (9.4% fraud rate, strong billing signals),
pos_weighttuning had minimal impact on ROC AUC (+0.003). On Oracle (6% fraud rate, weak signals), class weighting had a more meaningful effect on recall. -
XGBoost's low decision threshold on Healthcare (0.14) reflects extreme recall optimisation. The F2-maximising threshold of 0.14 means XGBoost flags a provider as fraudulent whenever its predicted probability exceeds 14%. This results in 91 false positives (non-fraudulent providers incorrectly flagged) against only 6 missed fraud providers in the 1,082-provider test set — an appropriate trade-off when the cost of missing fraud far exceeds the cost of a false investigation.
-
The Autoencoder provides marginal uplift over Isolation Forest on Oracle (ROC AUC 0.560 vs 0.507) but both are near-random. An ensemble of the two achieves ROC AUC 0.509, offering no meaningful gain. When the underlying feature space contains no anomaly structure, combining weak anomaly detectors does not recover discriminative power.
-
Oracle models are historical. The model scripts in
models/target only the Healthcare dataset. Oracle results infraud_oracle_results.txtwere produced from earlier scripts that have since been removed. Oracle results cannot be reproduced from the current codebase without re-implementing those scripts. -
No cross-dataset generalisation testing. Models trained on one dataset were not evaluated on the other. It is unknown whether billing-behaviour features learned from Medicare generalise to vehicle insurance, or vice versa.
-
Isolation Forest contamination is manually set. The
contamination=0.35parameter for Healthcare IF was chosen to maximise recall rather than matching the actual fraud rate (~9.4%). This inflates recall/F2 at the expense of precision and makes thepredict()output incomparable to the supervised models without threshold adjustment. The ROC AUC (computed from continuousscore_samples) is unaffected by this choice. -
Oracle feature ceiling. The Fraud Oracle dataset provides no shared identifiers between claims, making cross-claim aggregation impossible. The ~0.85 ROC AUC ceiling likely reflects this structural limitation rather than model inadequacy.
-
Single train/test split. Final test results are reported on a single 20% stratified split (random_state=42). With small minority-class counts (101 fraud in Healthcare test, 185 in Oracle test), reported metrics carry meaningful variance. Bootstrap confidence intervals were not computed.
-
Temporal leakage not fully controlled. The Oracle dataset has a
Yearcolumn but the split is not time-based. Claims from later years may appear in training when earlier-year claims are in the test set. The Healthcare data is from a single year (2009), so this does not apply there.
pip install -r requirements.txt# Step 1: Preprocess raw Oracle CSV
python utils/preprocess_fraud_oracle.pyThe Oracle model scripts are not in the current repository. Results are documented in
fraud_oracle_results.txt.
# Step 1: Preprocess — joins 4 CSV files and aggregates to provider level
python utils/preprocess_healthcare_fraud.py
# Step 2: Run individual models
python models/LogisticRegression.py
python models/NeuralNetworks.py
python models/XGBoost.py
python models/IsolationForest.py
# Step 3 (optional): Generate report figures → saved to figures/
python utils/visualize.py- Python 3.10+
- PyTorch ≥ 2.0
- pandas ≥ 2.0
- NumPy ≥ 1.24
- scikit-learn ≥ 1.3
- xgboost ≥ 2.0


