Skip to content

youssof20/ecg-safety-auditor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECG Safety Auditor

Python PyTorch License PhysioNet

Most ECG papers only report accuracy. This project asks what happens when the model is wrong.


The Problem

Lots of ECG AI papers report very high accuracy on benchmark data. But accuracy counts every error the same. Mixing up two similar rhythms is treated the same as missing a heart attack.

In real hospitals, those mistakes are not the same.

This project adds a danger-weighted error rate (DWE). I score errors by how bad they are for the patient, not just how often they happen. I train a 12-lead ECG model on PTB-XL, then check its mistakes against a simple danger table. I also look at who gets the worst errors (e.g. by age or sex).


Main Results

  • 4.1% of test ECGs had a critical error - either a missed heart attack or a false alarm. For a hospital doing 500 ECGs per day, that is about 20 of these errors per day.
  • Danger goes up with age. DWE goes from 0.31 (under 40) to 0.55 (over 75). It gets worse for older patients.
  • The model misses real heart attacks more than it false-alarms. 61 real MIs were called normal. 27 normal ECGs were called MI.

The Danger Matrix

I use a simple table: not all wrong answers are equally bad.

True / Pred NORM MI STTC CD HYP
NORM - 3 2 1 1
MI 3 - 2 1 1
STTC 2 2 - 1 1
CD 1 1 1 - 1
HYP 1 1 1 1 -
  • 3 = Critical (missed MI or sending a healthy person to the cath lab)
  • 2 = Moderate (missed ischemia, needs quick follow-up)
  • 1 = Minor (wrong but lower immediate risk)

Danger-weighted error rate:

DWE = sum(confusion[i][j] * danger[i][j]) / N

Results

Confusion matrix with danger overlay

Red borders = critical errors (danger 3).

Confusion matrix with danger overlay

Danger-weighted error by class

MI and NORM have the highest DWE.

Per-class DWE

Summary table

Model Accuracy Macro F1 AUC DWE Critical errors
ResNet1D (12-lead) 0.703 0.600 0.874 0.475 88 (4.1%)

DWE by age group

Older patients get more dangerous errors.

DWE by age

Critical errors: missed MI vs false alarm

Critical error breakdown

DWE by sex

DWE by sex


Streamlit app screenshot

Streamlit app


How to run it

1. Clone and install

git clone https://github.com/youssof20/ecg-safety-auditor.git
cd ecg-safety-auditor
pip install -r requirements.txt

2. Get the data (PTB-XL)

Download from: https://physionet.org/content/ptb-xl/1.0.3/

Put the contents in a data/ folder so you have:

data/
  ptbxl_database.csv
  scp_statements.csv
  records100/

3. Run the pipeline

python -m src.data_pipeline   # Step 1: load and split data
python -m src.train           # Step 2: train the model (takes a while on CPU)
python -m src.safety_audit    # Step 3: compute DWE and subgroups
python -m src.visualize       # Step 4: save the figures

4. Open the app

python -m streamlit run app.py

Project layout

ecg-safety-auditor/
  app.py                  - Streamlit app (viewer, results, subgroups)
  requirements.txt
  README.md
  LICENSE
  .gitignore
  src/
    data_pipeline.py      - Load PTB-XL, labels, train/val/test split
    models.py             - ResNet1D 12-lead model
    train.py              - Training with class weights and early stop
    safety_audit.py       - DWE and subgroup analysis
    visualize.py          - Script that makes the 5 figures
  outputs/
    figures/              - PNGs (in git)
    results/              - JSON/CSV (in git; no .npy)
  data/                   - PTB-XL goes here (not in git)

Why this matters

Accuracy alone is not enough to judge clinical AI. A 70% accurate model that often misses MIs in older men is worse in practice than a 65% model that mostly makes small mistakes. DWE is one simple way to see that difference.

The age pattern (DWE 0.31 to 0.55) suggests the model works less well for older patients. Their ECGs are often more complex. Looking at subgroups (age, sex) makes that visible. Overall accuracy would not.

Before using any ECG model in a real clinic, it should be checked not only for accuracy but for where and how badly it fails.


Limitations

  • PTB-XL is from one center (Leipzig). It may not match other hospitals or devices.
  • The danger table is based on clinical judgment, not a formal study. Others might use different numbers.
  • The HYP class has only 83 test samples, so its F1 (0.22) is uncertain.
  • I only evaluated one model. Comparing DWE across several models would be more informative.

Citation

Youssof Sallam. ECG Safety Auditor (2025).
https://github.com/youssof20/ecg-safety-auditor


Thanks

ECG data from PhysioNet - PTB-XL (Wagner et al., Nature Scientific Data, 2020).

Releases

No releases published

Packages

 
 
 

Contributors

Languages