Skip to content

Proposal: Modeling Data subdirectories are a Drop Box #52

@dwr-psandhu

Description

@dwr-psandhu

Dropbox

Proposal 1: Modeling Data subdirectories are a Drop Box

In this proposal, you can put things anywhere in Modeling_Data as long as:
• you can point to a reader that reads data, applies provider flags the way you want, transforms it into a dataframe .
• filenames sort lexicographically,
• you need to make a small entry in recpies/data_recipes.yaml describing how to read it and a few pieces of metadata.
• checker will be provided.
Nightly they will be swept into /formatted and thereafter they are safe, although whether the raw is safe or not is kind of up to users.

Use cases:

  1. Mokelumne. Populated by usgs and two type of ebmud.
  2. Daily data that can be downloaded: This would not be included if a daily downloader can do it in roughly the same way we now do the continuous data.
  3. CCF gates. This is provided to us in a subfolder automatically by the SCADA people. Continuous but not regular. I derive a simpler series that is even more irregular but sparser and distills the information in a useful way.
  4. Banks pumping. This is grabbed opportunistically and considerably transformed from pumping switches to flow in CFS
  5. Unofficial data from official sources: Often we get data from the flow/WQ groups at NCRO that they don't want to publish officially but that we can describe as a short term station. Often these are "cross program" collections – for instance stage data collected by the flow group. They are acquired during projects or over email. They may or may not be maintained long term.

The proposal is that these can be put in /dropbox/data but also anywhere on Modeling_Data
Modeling_Data

  • repo
    • continuous
      • formatted
    • daily
      • formatted (or should it be repo->formatted?)
    • repo_staging
      • continuous
        • daily
  • repo_dropbox
    • data
    • recipes
      • data_recipes.yaml
    • mokelumne

The crux is data_recipes.yaml, the purpose of which is to do the following:

  1. Make sure we know what is/has been swept into our repository
  2. Make it easy to update stray data by adding more
  3. Connect the entries to enough metadata and standardization
  4. Address how possibly-overlapping updates work.
  5. Allow the user to launch a checker:

- name: pcnb_elev
  file_pattern: SomeWeird_MOKE_golf_name.csv
  location: Modeling_Data/repo_dropbox/data    # This could be understood as a default 
  reader: read_ts        # Names, pointers to code etc. To be fleshed out
  reader_params: ... 
  freq: 15min   # None for irregular, "infer" for infer.
  metadata:              # Anything can be added, but the items below are required for a well formed entry
     station: pcnb       # Entry required in station_dbase.csv
     sublocation: None   # Optional but if provided checked against list
     agency: dwr_ncro    # Robust to aliases like "ncro" or "dwr-ncro"
     variable: elev      # Checked against data dictionary, some translations for common terms
     unit: stage         # Checked against data dictionary, some translations for common terms (e.g. cfs to ft^3/s)
- name: etc  

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions