Central management system for the ASAP CRN Cloud data infrastructure. Initially this will use python scripts maintain the source-of-truth archives for ASAP CRN Cloud entities: Datasets, Collections, Releases, and Common Data Elements (CDE). In the future we would like to use configuration files and GitHub Actions to perform the maintainebce.
| Repository | Purpose |
|---|---|
| ASAP-CRN/cloud-datasets | Source-of-truth archive for all team-contributed datasets |
| ASAP-CRN/cloud-collections | Curated collections of datasets, versioned for VWB Data Collections |
| ASAP-CRN/cloud-releases | Release records tying datasets and collections to versioned snapshots |
| ASAP-CRN/cloud-cde | Common Data Element definitions and versioning |
Functionality involves several steps which create effects in Datasets, Collections, Releases, CDE repos.
As contributions to the ASAP CRN Cloud are Accepted one of the first steps is to create a zenodo Dataset DOI. Datasets are accepted as "v0.1". When Datasets are first released, the Datasets are version bumped to "v1.0". Any additional changes to Datasets can result in major or minor version bumps (depending on the revisions being made.) A Dataset's all versions reference is contained in the "dataset.doi" file. Individual release version references are kept in "version.doi" files organized by version.
functions:
create_datasetmake_wip_datasetcreate_dataset_doi:
publish_dataset:publish_dataset_doi:
update_dataset:update_dataset_doi:update_dataset_version:
Scripts named by tranches of datasets will use these functions to compose configurations for each dataset (new_dataset.json), and then either update or create and then publish those datasets.
Regular "Urgent" Releases to ASAP CRN Cloud are made for newly Accepted but uncurated Datasets. Less regular "Major" or "Minor" Releases are made to release new or updated Curated Datasets. Curated Datasets are organized into Collections which share common Curation workflows/pipelines.
define_release: Enumerates the release number, what type of release (Urgent, Minor, Major), and which individual Datasets and Colections belong to the Release.perform_release: Createrelease.json, and manage the release archive, and produce the releases.json
Scripts named by versions will use these functions to compose configurations for each release update (new_release.json), and then create those releases. Note that new_collections.json may be defined using the functions detailed below.
"Major" or "Minor" Releases are made to release new or updated Curated Datasets. Curated Datasets are organized into Collections which share common Curation workflows/pipelines.
define_collection: Reads the Collection update from thedefine_releasedescribed above.update_collection: Reads the Collection details created fromdefine_collectiondescribed above and updates the details.
The asap_orchestrator Python package coordinates operations across all managed repositories.
import asap_orchestrator as aoTBC
datasets.json # Master index of all datasets
├── WIP/ # Triage area for WIP Datasets not yet released
| └──<dataset_name>/
| ├── wip_files
: :
└── <dataset_name> # format: <team>-<tissue>-<modality>_<unique_name>
├── dataset.json # Metadata: DOI, GCS buckets, releases, CDE version
├── DOI/ # Zenodo deposition files and DOI references
├── refs/ # Reference files for current version
└── archive/<version>/ # Immutable snapshots of past versions
└── DOI/ # Version-specific DOI files
collections.json # Master index of all collections
└── <collection-name>/
├── collection.json # Metadata: DOI versions, datasets per release version
└── archive/<version>/ # Immutable snapshots of past versions
└── collection.json # Version-specific metadata snapshot
releases.json # Master index of all releases
└── <release-version>/
├── release.json # Snapshot: all datasets, new_datasets, collections, CDE
└── *README*.pdf # Release-specific README
cdes.json # Index of all Common Data Elements with versions
└── <cde-version>/
├── cde.json # CDE date, version, list of tables
├── cde.csv # Snapshot CDE schema table
The bootstrap/ directory contains scripts, tools, and templates used to create the historical (pre Release v4.0.1) archive of Datasets, Collections, and Releases. An initial YOLO stab of Claude generated code for the asap_orchestrator is also here.