- Description
- Fed-FDR workflow
- Repository layout
- Software requirements
- Reproducing the simulation studies
- Reproducing the real data analysis
- Notes on output and reproducibility
- Support
This repository contains code to reproduce the simulation studies and the real data analysis reported in the manuscript. All code is written in R. Results are written to .rds files and figures are produced from dedicated plotting scripts.
In this repository, we include a synthetic dataset sample_data_to_run.csv that was generated to approximate the structure of the real-world COVID-19 pediatric dataset. We generated a synthetic dataset with 3,990 patients across 34 clinical sites, containing 243 binary covariates and one binary outcome. The marginal distributions and correlation structure of the features were designed to resemble those of the original EHR dataswet, ensuring that the synthetic data are representative for testing and reproducing the analysis pipeline.
-
Stage I:
- Each collaborating site
$k \in {1, \ldots, K}$ fits a GLM–Lasso to obtain its support$\hat{S}^{(k)}$ , which is then shared with all other sites. - Each collaborating site fits a refined de-sparsified Lasso using the aggregated support
$\hat{S}^{(-k)} = \bigcup_{j \neq k} \hat{S}^{(j)}$ . - Each collaborating site transfers the resulting estimator
$\hat{\beta}_{\hat{S}^{(-k)}}$ to the central site.
- Each collaborating site
-
Stage II:
- The central site constructs mirror statistics to select the final support while controlling the FDR.
- NOTE: Privacy-Preserving Distributed Algorithms (PDA) is a framework of statistical and machine learning methods that enables secure analysis across multiple institutions without sharing individual patient data (IPD). In this document, we use PDA to refer to the central site.
Folder: simulation_result
Scripts to run:
simulation_n500p500.Rsimulation_n500p1000.Rsimulation_n1000p500.Rsimulation_scalebility.R
Folder: use case
Main script:
Table1.R
Support file loaded by the main script:
Fed_simulation_functions.R
Sample dataset:
sample_data_to_run.csv
- R version 4.4.1 or later.
- RStudio is recommended for interactive work.
- Base R packages only, unless a script prompts you to install an additional package.
- Open R or RStudio.
- Set the working directory to the repository root.
- Run one or more of the simulation scripts listed above. For example:
source("simulation_result/simulation_n500p500.R") source("simulation_result/simulation_n500p1000.R") source("simulation_result/simulation_n1000p500.R") source("simulation_result/simulation_scalebility.R")
- Each script writes its outputs as
.rdsfiles insidesimulation_result. - To recreate the figures in the manuscript, run:
source("simulation_result/Figure1.R") source("simulation_result/Figure2.R") source("simulation_result/Figure3.R")
- Open R or RStudio.
- Set the working directory to the folder
use case. - Ensure the sample dataset
sample_data_to_run.csvis present in the same folder. - Run the main script:
The file
source("use case/Table1.R")Fed_simulation_functions.Ris sourced automatically byTable1.R. - Outputs are written as
.rdsfiles insideuse case. - To produce the ROC figure from the manuscript, run:
source("use case/Figure4.R")
- All scripts set their own random seeds when applicable. If you require exact replication, do not modify those seeds.
- Figures are regenerated from the
.rdsresult files. If you delete or relocate those files, recreate them by rerunning the corresponding simulation or analysis script.
For questions about the code or the study design, please open an issue in the repository.
