This repository contains simple supporting files for a paper on Sustainable Cloud Operations for Research (SCORE). The paper describes a practical framework for choosing cloud-based data ingestion approaches in research settings where cost, technical capacity, and sustainability matter.
The repository is intentionally small. It is meant to help a reader understand and reproduce the main evaluation steps, not to provide a production cloud deployment.
data/
paper_results.csv Main results reported in the paper
dataset_manifest_template.csv Template for listing source dataset files
docs/
reproduction_notes.md Plain-language notes on how the experiment was run
gui_approaches.md Notes for the GUI/no-code parts of the work
notebooks/
01_code_ingestion_example.ipynb Example notebook for the Python/code approach
02_summarise_results.ipynb Simple summary of the reported results
scripts/
summarise_results.py Command-line version of the results summary
requirements.txt
LICENSE
CITATION.cff
The paper compared three ways of ingesting the same large public dataset into cloud storage:
- Synapse Pipelines: a no-code or GUI-based approach.
- Synapse Notebook: a Python notebook/code-based approach.
- Azure Functions: a serverless approach.
The comparison focused on:
- pipeline cost,
- execution time,
- estimated carbon emissions,
- storage/write costs, and
- geo-replication transfer costs.
The reported values are stored in data/paper_results.csv.
The paper used the NIH Chest X-ray dataset as a large public data ingestion workload. The raw dataset is not included in this repository because it is large and should be downloaded from the official source by each user.
Use data/dataset_manifest_template.csv as a simple template for recording the files used
in a reproduction run.
Install the small Python requirements:
pip install -r requirements.txtSummarise the reported results:
python scripts/summarise_results.pyOpen the notebooks:
notebooks/01_code_ingestion_example.ipynb
notebooks/02_summarise_results.ipynb
The first notebook shows the structure of the Python/code approach. It defaults to a dry-run style example and does not download or upload the full dataset.
- Do not commit raw medical images or cloud credentials.
- Keep cloud cost exports separate unless they have been reviewed for sharing.
- If you rerun the experiment, record the cloud region, date, resource type, and any changes to the dataset or replication settings.