Pipeline Standard Data Backend(s) #53

pearsonca · 2025-11-19T23:10:21Z

pearsonca
Nov 19, 2025
Maintainer

porting from slack

re hubverse, the task id spec indicates anything goes for columns (tbd: how to do w/ sample/match/stochastic_run/whatever specification), but recommends scenario_id for indicating the scenario element of a task. what actually determines the task id spec for any particular hub is the corresponding config file.
aside: unclear if what they're actually doing here is a json schema file for config, but kinda looks that way. they definitely need to better on their schema-of-schemas tho for the configs spec.

the hubverse output directory structure would be just-do-able after Backend templated paths · Issue #46 · ACCIDDA/flepimop2.
notably, however, the hubverse format does not "just" read with arrow partitioning - arrow >8.0 supports picking partition info out of filenames in addition to directory structure, but doesn't do so automatically.
my intent is for flepimop default-arrow-oriented backends to just-work with arrow. the hubverse output should be (an easily requestable) non-default - something one might use for final post processing outputs, but unlikely to want unless specifically needing hubverse tools

in terms of sample/match/whatever id, my read of the documentation says hubverse wants output_type_id - Model output — Hubverse. but that seems inconsistent w/ most recent SMH experience

aside: hubverse seems to have boffed it (formally: failed to adopt proper normal form) on output_type / output_type_id - might be solvable by pulling the output type definition into the partitioning from file path / names?

update: this is indeed inconsistent w/ smh, since apparently smh uses a customization of hubverse for validation. that does effectively generate a pseudo output type id, which is used in an ad hoc post-processing chain for ensembles, etc

thinking about addressing the normalization defect in hubverse & intersection with how arrow behaves:
arrow partitioning doesn't want to be "reused" - as in: one can't (automatically) practically have the partitioning done by file structure and then at the bottom have one set of files for, say, a proportion output and another for a count output (AND have type-based guarantees; this is more obviously a problem with character outputs when your numerical outputs get "promoted" to strings)
hubverse standard has two type promotion problems: with output value directly AND output_type_id type (which mixes ordinals, strings, numerics, and probabilities for the current standard offerings)

it's possible to solve the arrow problem generally by having a result-distinguishing type parameter at the outermost layer. if target distinguishes value type, then target=X/...other partition info.../data.csv can naturally be written by arrow (with properly typed tabular data objects by target type(s)), but then reading can occur one layer down and type will be preserved. its mildly annoying to have to reach in when reading, but probably balanced with having to cast if one doesn't do this sort of thing. probably the most annoying version is when most-but-not-all targets are of one type, and it would be nice to read the same kinds all in at once (because you're probably going to the same operations on all of them, even if you aren't going necessarily use them together in calculations). that can be addressed by better specification design (essentially, have target_type distinguish value type if necessary, then you can grab e.g. all of the count outcomes at once from target_type=count/target=(X|Y|Z)/.../data.csv instead of having to reach into each target directory, read, then combined)

to avoid leaking the hubverse normalization problem: could do multiple outer layers for however many types need normalization - so two layers (one for target value type and one for output_type_id type). alternatively, the could have internal spec for output_type_id be unambiguously natural number ids (since these are actually always counting indicators, tho potentially either ordinal or cardinal), and then have a join table for translation when necessary.

the first approach is easiest to just roll up to the hubverse spec (slurp-merge, or if they support arrow-style-directory partitioning in the future, literally just send the root), but makes the reasonable solution for any-non-hubverse activities in previous para to mostly-same-type-targets more annoying (i think, haven't worked thru all the details)
the second approach preserve simplicity for non-hubverse use and can be substantially more memory efficient (shorts are cheaper than floats or strings), but requires a join operation to get to hubserve spec. also, post hoc extension (e.g. to add more quantile targets, say) makes no guarantees about preserving the id ordinality and the translated object ordinality

performance aside: having categorical indicators manifest as strings (or really anything other than shorts) can be a real memory drag, especially when there are lots of columns. when doing more than slightly complicated scenario analysis, the combinatorial explosion on keying information can get you pretty quickly if you inflate the fully-represented-objects in memory. easy to lower your problem size ceiling by 10-100x

summarizing:
there isn't much standard in the hubverse standard - lots of flexibility afforded to hubs (generally a good thing imo)
but using approximately hubverse standard elements as a baked in answer for timeseries data is reasonable - we can aim for that as a built-in pipeline backend module.
getting precisely to hubverse-now-standard from that would be "one click" (similar for SMH) - seems like a standard-but-external post-processing module (n.b. this would compartmentalize hubverse dependencies, as that post processing module probably also wants to pull in their tooling for validation, e.g.)

"backend" being the nomenclature for "thing that interacts with disk", and "process" being the nomenclature for adhoc/bespoke ETL operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Standard Data Backend(s) #53

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Pipeline Standard Data Backend(s) #53

Uh oh!

pearsonca Nov 19, 2025 Maintainer

Replies: 0 comments

pearsonca
Nov 19, 2025
Maintainer