Python package that creates an inventory for HuBMAP datasets.
The inventory is composed of three files
- a TSV with all file level features
- a JSON file with basic metadata information and file manifest
- a compressed JSON file
Read this
- this package needs access to the file system.
- protected and public published datasets can be processed on
HIVEby thehiveuser - public published datasets can be processed on
Bridges2by any user (data is public) - there is a bottleneck associated with the maximum number of files that can be processed at once. The magic number is
ncores = 25.
The JSON file is a dictionary style structure with dataset and file level information. The keys of this dictionary are
data_type- CODEX, AF, etc.directory- directory path on Hivedoi_url- the DOI URL, if anyfrequencies- frequencies of file extensions in this dataset. Useful for building histogramshubmap_id- dataset HuBMAP IDis_protected- True if protected, False otherwisemanifest- a dictionary with file level statistics for each file in this datasetnumber_of_filespretty_size- an easy to read string representing the size of the data directorysize- size in bytes of the data directorystatus- Published, etc.uuid- dataset UUID
The manifest key in the dictionary above is a list of dictionaries as well. Each dictionary has file level information about a file in the dataset. The list as a long as there are files in the dataset. The keys of each dictionary in the list are
download_url- Globus direct download URL. Does not apply for protected datasets.extensionfilenamefiletype- image, sequence or otherfullpathmd5- checksummime-typemodification_timesha256- checksumsize- size in bytes
See examples folder for Jupyter Notebooks and simple scripts.
Copyright © 2020-2023 HuBMAP.