ECG Safety Auditor

Most ECG papers only report accuracy. This project asks what happens when the model is wrong.

The Problem

Lots of ECG AI papers report very high accuracy on benchmark data. But accuracy counts every error the same. Mixing up two similar rhythms is treated the same as missing a heart attack.

In real hospitals, those mistakes are not the same.

This project adds a danger-weighted error rate (DWE). I score errors by how bad they are for the patient, not just how often they happen. I train a 12-lead ECG model on PTB-XL, then check its mistakes against a simple danger table. I also look at who gets the worst errors (e.g. by age or sex).

Main Results

4.1% of test ECGs had a critical error - either a missed heart attack or a false alarm. For a hospital doing 500 ECGs per day, that is about 20 of these errors per day.
Danger goes up with age. DWE goes from 0.31 (under 40) to 0.55 (over 75). It gets worse for older patients.
The model misses real heart attacks more than it false-alarms. 61 real MIs were called normal. 27 normal ECGs were called MI.

The Danger Matrix

I use a simple table: not all wrong answers are equally bad.

True / Pred	NORM	MI	STTC	CD	HYP
NORM	-	3	2	1	1
MI	3	-	2	1	1
STTC	2	2	-	1	1
CD	1	1	1	-	1
HYP	1	1	1	1	-

3 = Critical (missed MI or sending a healthy person to the cath lab)
2 = Moderate (missed ischemia, needs quick follow-up)
1 = Minor (wrong but lower immediate risk)

Danger-weighted error rate:

DWE = sum(confusion[i][j] * danger[i][j]) / N

Results

Confusion matrix with danger overlay

Red borders = critical errors (danger 3).

Danger-weighted error by class

MI and NORM have the highest DWE.

Summary table

Model	Accuracy	Macro F1	AUC	DWE	Critical errors
ResNet1D (12-lead)	0.703	0.600	0.874	0.475	88 (4.1%)

DWE by age group

Older patients get more dangerous errors.

Critical errors: missed MI vs false alarm

DWE by sex

Streamlit app screenshot

How to run it

1. Clone and install

git clone https://github.com/youssof20/ecg-safety-auditor.git
cd ecg-safety-auditor
pip install -r requirements.txt

2. Get the data (PTB-XL)

Download from: https://physionet.org/content/ptb-xl/1.0.3/

Put the contents in a data/ folder so you have:

data/
  ptbxl_database.csv
  scp_statements.csv
  records100/

3. Run the pipeline

python -m src.data_pipeline   # Step 1: load and split data
python -m src.train           # Step 2: train the model (takes a while on CPU)
python -m src.safety_audit    # Step 3: compute DWE and subgroups
python -m src.visualize       # Step 4: save the figures

4. Open the app

python -m streamlit run app.py

Project layout

ecg-safety-auditor/
  app.py                  - Streamlit app (viewer, results, subgroups)
  requirements.txt
  README.md
  LICENSE
  .gitignore
  src/
    data_pipeline.py      - Load PTB-XL, labels, train/val/test split
    models.py             - ResNet1D 12-lead model
    train.py              - Training with class weights and early stop
    safety_audit.py       - DWE and subgroup analysis
    visualize.py          - Script that makes the 5 figures
  outputs/
    figures/              - PNGs (in git)
    results/              - JSON/CSV (in git; no .npy)
  data/                   - PTB-XL goes here (not in git)

Why this matters

Accuracy alone is not enough to judge clinical AI. A 70% accurate model that often misses MIs in older men is worse in practice than a 65% model that mostly makes small mistakes. DWE is one simple way to see that difference.

The age pattern (DWE 0.31 to 0.55) suggests the model works less well for older patients. Their ECGs are often more complex. Looking at subgroups (age, sex) makes that visible. Overall accuracy would not.

Before using any ECG model in a real clinic, it should be checked not only for accuracy but for where and how badly it fails.

Limitations

PTB-XL is from one center (Leipzig). It may not match other hospitals or devices.
The danger table is based on clinical judgment, not a formal study. Others might use different numbers.
The HYP class has only 83 test samples, so its F1 (0.22) is uncertain.
I only evaluated one model. Comparing DWE across several models would be more informative.

Citation

Youssof Sallam. ECG Safety Auditor (2025).
https://github.com/youssof20/ecg-safety-auditor

Thanks

ECG data from PhysioNet - PTB-XL (Wagner et al., Nature Scientific Data, 2020).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECG Safety Auditor

The Problem

Main Results

The Danger Matrix

Results

Confusion matrix with danger overlay

Danger-weighted error by class

Summary table

DWE by age group

Critical errors: missed MI vs false alarm

DWE by sex

Streamlit app screenshot

How to run it

1. Clone and install

2. Get the data (PTB-XL)

3. Run the pipeline

4. Open the app

Project layout

Why this matters

Limitations

Citation

Thanks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
outputs		outputs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ECG Safety Auditor

The Problem

Main Results

The Danger Matrix

Results

Confusion matrix with danger overlay

Danger-weighted error by class

Summary table

DWE by age group

Critical errors: missed MI vs false alarm

DWE by sex

Streamlit app screenshot

How to run it

1. Clone and install

2. Get the data (PTB-XL)

3. Run the pipeline

4. Open the app

Project layout

Why this matters

Limitations

Citation

Thanks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages