-
Notifications
You must be signed in to change notification settings - Fork 7
Data management guide #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
56b01bf
Add data storage, formats, and access information
vmartinez-cu bb46bad
update purpose and intro sections
vmartinez-cu 4c9018c
Move guide out of file formats folder and add links to relevant guides
vmartinez-cu 1196dac
Consolidate purpose section with intro section. Add acronym definitions
vmartinez-cu 88ace0a
Fix typo
vmartinez-cu 7ae4979
Add images from confluence page
vmartinez-cu c914ba2
Bold sub-headings for improved readability
vmartinez-cu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| # Managing Data: Storage, Format, Access | ||
|
|
||
| This guide covers storage options, data format choices, and access protocols for datasets to ensure that datasets are | ||
| managed in ways that meet open science principles and institutional requirements. It supports Data Systems workflows by | ||
| improving data accessibility, promoting reproducibility, and enhancing usability and compliance with funder and | ||
| publisher requirements. | ||
|
|
||
| ## Storage | ||
|
|
||
| Here we assume that a dataset must be publicly accessible: machine readable on a public network in a manageable format. | ||
| Storage on local disks or external hard drives is not considered open and accessible. | ||
|
|
||
| Storage choice depends on many factors, including dataset size, access needs, performance needs, need for | ||
| authentication, other data management needs (e.g., the need for a DOI), etc. | ||
|
|
||
| **LASP resources** | ||
|
|
||
| - LISIRD, for solar irradiance and related data products | ||
| - LASP options: (e.g. dsapps, lasp-store) | ||
| - LEMR for metadata | ||
|
|
||
| **CU Boulder resources** | ||
|
|
||
| - CU Scholar is suitable for small datasets that provides a DOI, a generically-styled landing page, and some storage. | ||
| - CU Boulder PetaLibrary for information about CU Research Computings PetaLibrary storage resources. | ||
|
|
||
| **External repositories** | ||
|
|
||
| - Dryad: a curated resource that makes research data discoverable, freely reusable, and citable. | ||
| - Figshare: a web-based interface designed for academic research data management and research data dissemination. | ||
| It accepts all file types (with in-browser viewing). | ||
| - Simple storage in the cloud: AWS Glacier, S3 | ||
| - Fedora, Zenodo, Open Science Framework, Dataverse | ||
|
|
||
| ## Data Formats | ||
|
|
||
| To maximize the potential application of tools to data, data should be provided in common, open, self-describing, | ||
| machine readable formats such as: | ||
|
|
||
| - netCDF/HDF | ||
| - CDF | ||
| - ASCII | ||
| - FITS | ||
| - JSON | ||
|
|
||
| What is considered 'common' may vary somewhat by science domain. The use of proprietary data formats, such as IDL | ||
| ".sav" files, is discouraged because of the need for a license in order to use the data. | ||
|
|
||
| "Self-describing" means that metadata about the data is included in the file or package. Self-describing formats | ||
| include: NetCDF, HDF, and CSV with headers. ASCII or CSV data without header information, or binary data, are not | ||
| self-describing. | ||
|
|
||
| ## Machine Readability | ||
|
|
||
| Any dataset can be machine readable if one writes one-off code to do so. However, datasets should be interoperable, | ||
| which implies not having to create nor rely on specialized code. | ||
|
|
||
| Machine readability isn't a binary condition of a dataset; it's a continuum. The issue is whether one can use a dataset | ||
| in an existing tool that conforms to current standards. Use of a common data format or metadata schema is the first | ||
| step in achieving machine readability. | ||
|
|
||
| There are ways to structure data that foster machine readability, and ways that do not. | ||
|
|
||
| To improve machine readability: | ||
|
|
||
| - Use consistent naming conventions and delimiters | ||
| - Represent missing values clearly | ||
| - Include metadata in self-describing formats | ||
|
|
||
| ### Examples of what to do and what not to do with your data | ||
|
|
||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|  | ||
|
|
||
| ## Useful Links | ||
|
|
||
| - [Data Stewardship at LASP](data_stewardship.md) | ||
| - [FAIR Principles](fair_principles.md) | ||
| - [netCDF](file_formats/netcdf.md) | ||
| - [LISIRD](https://lasp.colorado.edu/lisird/) | ||
| - [Zenodo](https://zenodo.org/) | ||
| - [CU Scholar](https://scholar.colorado.edu/about) | ||
| - [CU PetaLibrary](https://www.colorado.edu/rc/resources/petalibrary) | ||
| - [Dryad](https://datadryad.org/) | ||
| - [Figshare](https://figshare.com/) | ||
| - [AWS Glacier](https://aws.amazon.com/glacier/) | ||
| - [AWS S3](https://aws.amazon.com/s3/) | ||
| - [Fedora](https://duraspace.org/fedora/) | ||
| - [Open Science Framework](https://osf.io/) | ||
| - [Dataverse](https://dataverse.org/) | ||
|
|
||
| ## Acronyms | ||
|
|
||
| List of acronyms used in the guideline | ||
|
|
||
| - **AWS** = Amazon Web Services | ||
| - **CDF** = Common Data Format | ||
| - **CSV** = Comma-Separated Values | ||
| - **DOI** = Digital Object Identifier | ||
| - **FITS** = Flexible Image Transport System | ||
| - **HDF** = Hierarchical Data Format | ||
| - **IDL** = Interactive Data Language | ||
| - **LEMR** = LASP Environmental Metadata Repository | ||
| - **LISIRD** = LASP Interactive Solar Irradiance Data Center | ||
| - **MOU** = Memorandum of Understanding | ||
| - **NetCDF** = Network Common Data Form | ||
|
|
||
| Credit: Content taken from a Confluence guide written by Anne Wilson and Shawn Polson | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, lots of acronyms in this one 😅