Most ECG papers only report accuracy. This project asks what happens when the model is wrong.
Lots of ECG AI papers report very high accuracy on benchmark data. But accuracy counts every error the same. Mixing up two similar rhythms is treated the same as missing a heart attack.
In real hospitals, those mistakes are not the same.
This project adds a danger-weighted error rate (DWE). I score errors by how bad they are for the patient, not just how often they happen. I train a 12-lead ECG model on PTB-XL, then check its mistakes against a simple danger table. I also look at who gets the worst errors (e.g. by age or sex).
- 4.1% of test ECGs had a critical error - either a missed heart attack or a false alarm. For a hospital doing 500 ECGs per day, that is about 20 of these errors per day.
- Danger goes up with age. DWE goes from 0.31 (under 40) to 0.55 (over 75). It gets worse for older patients.
- The model misses real heart attacks more than it false-alarms. 61 real MIs were called normal. 27 normal ECGs were called MI.
I use a simple table: not all wrong answers are equally bad.
| True / Pred | NORM | MI | STTC | CD | HYP |
|---|---|---|---|---|---|
| NORM | - | 3 | 2 | 1 | 1 |
| MI | 3 | - | 2 | 1 | 1 |
| STTC | 2 | 2 | - | 1 | 1 |
| CD | 1 | 1 | 1 | - | 1 |
| HYP | 1 | 1 | 1 | 1 | - |
- 3 = Critical (missed MI or sending a healthy person to the cath lab)
- 2 = Moderate (missed ischemia, needs quick follow-up)
- 1 = Minor (wrong but lower immediate risk)
Danger-weighted error rate:
DWE = sum(confusion[i][j] * danger[i][j]) / N
Red borders = critical errors (danger 3).
MI and NORM have the highest DWE.
| Model | Accuracy | Macro F1 | AUC | DWE | Critical errors |
|---|---|---|---|---|---|
| ResNet1D (12-lead) | 0.703 | 0.600 | 0.874 | 0.475 | 88 (4.1%) |
Older patients get more dangerous errors.
git clone https://github.com/youssof20/ecg-safety-auditor.git
cd ecg-safety-auditor
pip install -r requirements.txtDownload from: https://physionet.org/content/ptb-xl/1.0.3/
Put the contents in a data/ folder so you have:
data/
ptbxl_database.csv
scp_statements.csv
records100/
python -m src.data_pipeline # Step 1: load and split data
python -m src.train # Step 2: train the model (takes a while on CPU)
python -m src.safety_audit # Step 3: compute DWE and subgroups
python -m src.visualize # Step 4: save the figurespython -m streamlit run app.pyecg-safety-auditor/
app.py - Streamlit app (viewer, results, subgroups)
requirements.txt
README.md
LICENSE
.gitignore
src/
data_pipeline.py - Load PTB-XL, labels, train/val/test split
models.py - ResNet1D 12-lead model
train.py - Training with class weights and early stop
safety_audit.py - DWE and subgroup analysis
visualize.py - Script that makes the 5 figures
outputs/
figures/ - PNGs (in git)
results/ - JSON/CSV (in git; no .npy)
data/ - PTB-XL goes here (not in git)
Accuracy alone is not enough to judge clinical AI. A 70% accurate model that often misses MIs in older men is worse in practice than a 65% model that mostly makes small mistakes. DWE is one simple way to see that difference.
The age pattern (DWE 0.31 to 0.55) suggests the model works less well for older patients. Their ECGs are often more complex. Looking at subgroups (age, sex) makes that visible. Overall accuracy would not.
Before using any ECG model in a real clinic, it should be checked not only for accuracy but for where and how badly it fails.
- PTB-XL is from one center (Leipzig). It may not match other hospitals or devices.
- The danger table is based on clinical judgment, not a formal study. Others might use different numbers.
- The HYP class has only 83 test samples, so its F1 (0.22) is uncertain.
- I only evaluated one model. Comparing DWE across several models would be more informative.
Youssof Sallam. ECG Safety Auditor (2025).
https://github.com/youssof20/ecg-safety-auditor
ECG data from PhysioNet - PTB-XL (Wagner et al., Nature Scientific Data, 2020).





