Two-Stage Domain Adversarial Learning to Identify Chagas Disease from ECG and Patient Demographic Data

Team members: Xiaoyu Wang, Aron Syversen, Zixuan Ding, James Battye, Sharon Yuen Shan Ho, David C Wong

Introduction

This repository contains the code for the submission by our team, CinCo Amigos, to the George B. Moody PhysioNet Challenge 2025. Our goal was to develop an automated, open-source algorithm to detect Chagas disease using electrocardiograms (ECGs) and patient demographic data.

Chagas disease is widely underdiagnosed due to limited serological test coverage. Large-scale automated ECG screening offers a promising solution. However, this task presents significant challenges, including:

Significant Label Noise: The largest dataset (CODE-15) contains unreliable self-reported labels, whereas smaller datasets provide reliable annotations.
Extreme Class Imbalance: The prevalence of the positive class is only 2%.
Substantial Domain Shift: There is a noticeable performance drop between internal testing and public scoring metrics, indicating differences between data sources.

To address these challenges, we proposed a Two-Stage Domain Adversarial Learning approach. This framework combines a custom neural network architecture with noise-robust learning techniques, domain-adversarial methods, and advanced class-imbalance handling strategies.

Methodology Overview

Our approach follows a two-stage training paradigm, illustrated in Fig. 2 of the paper:

Stage 1: Pre-training with Noise and Domain Adaptation
- A custom neural network (an encoder based on ECGNeXt or SEResNet, plus a Meta Net for demographic covariates) is pre-trained on the large, noisy CODE15 dataset.
- LMFLoss, a combination of Focal Loss and Label-Distribution-Aware Margin (LDAM) Loss, is used to handle class imbalance.
- Early Learning Regularization (ELR) is integrated to counteract label noise.
- Domain-Adversarial Neural Network (DANN) is employed, incorporating several external datasets (e.g., CSPC, PTB, etc., ignoring their diagnostic labels) as distinct domains to learn domain-invariant features. The encoder is trained to confuse a domain classifier, forcing it to learn domain-agnostic representations.
Stage 2: Fine-tuning with Preservation of Domain Generalisation
- The model is adapted using smaller, high-quality datasets (e.g., SaMi-Trop, PTB-XL).
- Feature Distillation is used, where the pre-trained encoder acts as a frozen "teacher" model guiding the "student" encoder during fine-tuning to retain domain generalisation capabilities.
- Alternatively, the pre-trained encoder is frozen, and only the classifier head is fine-tuned.

How to Run the Code

1. Environment Setup

You can set up the environment in one of two ways:

Using Docker (Recommended):
Build the Docker image:
```
docker build -t cinc2025-image .
```
This command uses the Dockerfile to build an image containing all dependencies. The Dockerfile also automatically downloads the required external datasets into the /challenge/downloaded_data directory within the image.
Using a Python Virtual Environment:
Create and activate a virtual environment (e.g., using venv or conda), then install the required dependencies:
```
pip install -r requirements.txt
```
You will need to ensure the necessary datasets are downloaded and accessible. The Dockerfile indicates the HDF5 files required.

2. Data Preparation (If not using pre-downloaded data in Docker)

Download Datasets: Obtain the required ECG datasets (CODE-15%, SaMi-Trop, PTB-XL, CSPC, PTB, etc.) from the PhysioNet Challenge 2025 website and other sources cited in the paper.
Preprocessing: Run the relevant data preparation scripts to convert the raw data into HDF5 or WFDB format. Scripts provided include prepare_code15_data.py, prepare_samitrop_data.py, prepare_ptbxl_data.py, and Prepare_External_data.py.
- Example for CODE-15% data to WFDB:
```
python prepare_code15_data.py -i <path_to_input_hdf5_files> -d <path_to_demographics_csv> -l <path_to_labels_csv> -o <output_wfdb_folder>
```
- Example for preparing external datasets into HDF5:
```
python Prepare_External_data.py --data_dir <raw_data_directory> --output_path <output_hdf5_file_path>
```

3. Training the Model

Use the train_model.py script to train the model. If running inside a Docker container, make sure to mount your local data and model folders.

python train_model.py -d <path_to_training_data_folder> -m <path_to_model_output_folder> -v

<path_to_training_data_folder>: Directory containing the training data files (likely preprocessed .hdf5 files based on Dockerfile and dataset.py).
<path_to_model_output_folder\>: Directory where the trained model(s) will be saved.
-v: (Optional) Enable verbose output.

This script executes the train_model function in team_code.py, which implements the two-stage training strategy described in the paper.

4. Running the Model for Prediction

Use the run_model.py script to make predictions on new data.

python run_model.py -d <path_to_test_data_folder> -m <path_to_model_folder> -o <path_to_output_folder> -v

<path_to_test_data_folder>: Directory containing the data files for prediction (expects WFDB format .hea/.dat or .mat files).
<path_to_model_folder>: Directory containing the trained model(s) saved by train_model.py.
<path_to_output_folder>: Directory where the model's predictions (one .txt file per record) will be saved.
-v: (Optional) Enable verbose output.

This script calls the load_model and run_model functions from team_code.py.

5. Evaluating Model Performance

Use the official evaluate_model.py script (or the version included in this repository) to evaluate the model's performance.

python evaluate_model.py -d <path_to_labeled_data_folder> -o <path_to_model_output_folder> -s <path_to_scores_file>

<path_to_labeled_data_folder>: Directory containing the ground truth label files for the test data.
<path_to_model_output_folder>: Directory containing the model's predictions generated by run_model.py.
<path_to_scores_file>: (Optional) Path to save the evaluation scores in a CSV file.

The script will compute and output the challenge metrics, such as the Challenge score, AUROC, AUPRC, etc..

Results

Our approach achieved a mean Challenge score of 0.250 on the official hidden test sets of the PhysioNet Challenge 2025, ranking 7th out of 40 competing teams. Notably, our model ranked 1st on the ELSA-Brasil test set.

Citation

If you use this code or methodology in your research, please cite our paper:

Wang, X., Syversen, A., Ding, Z., Battye, J., Ho, S. Y. S., & Wong, D. C. (2025). Two-Stage Domain Adversarial Learning to Identify Chagas Disease from ECG and Patient Demographic Data. Computing in Cardiology

And the relevant PhysioNet Challenge 2025 papers.

Contact

For any questions, please contact Xiaoyu Wang (wmqn2362@leeds.ac.uk).

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
models		models
.gitattributes		.gitattributes
158_Preprint_CinC2025.pdf		158_Preprint_CinC2025.pdf
Dockerfile		Dockerfile
FoundationModel_FullTuning.out		FoundationModel_FullTuning.out
FoundationModel_FullTuning_AddDataAugmentation_LMFLoss_Official.out		FoundationModel_FullTuning_AddDataAugmentation_LMFLoss_Official.out
FoundationModel_FullTuning_AddDataAugmentation_Official.out		FoundationModel_FullTuning_AddDataAugmentation_Official.out
FoundationModel_FullTuning_GaussianNoise&Scale_Official.out		FoundationModel_FullTuning_GaussianNoise&Scale_Official.out
LICENSE		LICENSE
PhysioNet25_OverallFramework.pdf		PhysioNet25_OverallFramework.pdf
Prepare_External_data.py		Prepare_External_data.py
Prepare_External_data.sh		Prepare_External_data.sh
Prepare_Negative_data.py		Prepare_Negative_data.py
README.md		README.md
SEResNet_add_filter.out		SEResNet_add_filter.out
SEResNet_baseline.out		SEResNet_baseline.out
SeResNet_add_filter_MinMaxNorm_Ensemble_NoiseInFinetune.out		SeResNet_add_filter_MinMaxNorm_Ensemble_NoiseInFinetune.out
SeResNet_add_filter_ZScoreNorm_Ensemble.out		SeResNet_add_filter_ZScoreNorm_Ensemble.out
SeResNet_add_filter_ZScoreNorm_Ensemble_NoiseInFinetune.out		SeResNet_add_filter_ZScoreNorm_Ensemble_NoiseInFinetune.out
dataset.py		dataset.py
evaluate_model.py		evaluate_model.py
helper_code.py		helper_code.py
loss.py		loss.py
model.py		model.py
prepare_code15_data.py		prepare_code15_data.py
prepare_ptbxl_data.py		prepare_ptbxl_data.py
prepare_samitrop_data.py		prepare_samitrop_data.py
requirements.txt		requirements.txt
run_model.py		run_model.py
statics.py		statics.py
statics_result.out		statics_result.out
submit_FoundationModel.sh		submit_FoundationModel.sh
submit_SEResNet.sh		submit_SEResNet.sh
team_code.py		team_code.py
team_code_dividemix.py		team_code_dividemix.py
team_code_imbalance.py		team_code_imbalance.py
train_model.py		train_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Stage Domain Adversarial Learning to Identify Chagas Disease from ECG and Patient Demographic Data

Introduction

Methodology Overview

How to Run the Code

1. Environment Setup

2. Data Preparation (If not using pre-downloaded data in Docker)

3. Training the Model

4. Running the Model for Prediction

5. Evaluating Model Performance

Results

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Two-Stage Domain Adversarial Learning to Identify Chagas Disease from ECG and Patient Demographic Data

Introduction

Methodology Overview

How to Run the Code

1. Environment Setup

2. Data Preparation (If not using pre-downloaded data in Docker)

3. Training the Model

4. Running the Model for Prediction

5. Evaluating Model Performance

Results

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages