Releases: spiraldb/raincloud
v0.2.0 β lightweight loader, publish CLI, wheel-installable build pipeline
Added
raincloudloader package. A new importable package
(separate from thescripts/build pipeline) for loading
already-prepared artefacts.raincloud.load("<slug>")(alias
load_dataset) returns a lazyDatasethandle β nothing is fetched
until you call.path()/.to_arrow()/.scan()/.to_pandas().
Resolution order is local cache β mirror β local build: a cache
hit short-circuits, otherwise it pulls from the configured mirror,
and only on a cache+mirror miss does it shell out to
scripts.pipeline.build. Configured via env vars:RAINCLOUD_MIRROR
(anfsspecbase such ass3://bucket/prefixorfile:///pathβ
a private/internal artefact store, not a public Raincloud endpoint),
RAINCLOUD_CACHE(cache dir override),RAINCLOUD_OFFLINE
(cache-only; mirror/build misses raise),RAINCLOUD_STRICT_CHECKSUM
(opt-in hard integrity gate; see below). When the snapshot records a
checksum, a drift from it warns-and-adopts by default (see "Drift is an
alert"); where no checksum is recorded yet β most of the catalog today β
the pinned byte size is used as a cheap corruption check instead.scripts.pipeline.publishmirror-sync CLI.
python -m scripts.pipeline.publish <slugs|--all> --mirror <url>
uploads builtoutputs/v1/...artefacts to a mirror, gated on each
artefact's on-disk sha256 matchingdocs/v1/snapshot.json(slugs with
no recorded sha are uploaded ungated). Each upload streams to a
<key>.<uuid>.parttemp key and renames into place, so a mid-stream
crash never leaves a truncated object at the canonical key. The
snapshot is resolved via the sameRAINCLOUD_SNAPSHOTβ checkout β
wheel precedence the loader uses, not a hardcoded checkout path.
--dry-runpreviews the upload plan.parquet_sha256/vortex_sha256indocs/v1/snapshot.jsonβ
per-slug artefact checksums, used by both the loader (download
integrity) andpublish(the snapshot-match gate).examples/use_loader.pyβ runnable walkthrough of the loader API
(metadata access,.to_arrow/.scan/.to_pandasmaterialization,
format override, env-var configuration, the full
RaincloudErrorhierarchy). Runs against the packaged catalog with
no network;--materializeexercises the full resolution path.- Code-path example scripts in
examples/. Single-file demos that
load()a real catalog dataset and run a query:nyc_taxi_tip_rate.py
(DuckDB over.scan()on 48.7M yellow-cab trips β what share left no
recorded tip, bypayment_type),kepler_exoplanets.py(pandas
disposition counts + smallest confirmed planet),wine_quality_correlations.py
(featureβquality correlations), andolympic_medals.py(medals by NOC /
decade).tests/test_examples.pybyte-compiles every example and (under
--run-network) runs the kepler one end-to-end. - New agent skills:
raincloud-loadandraincloud-publish. Wrap
the loader API and the publish CLI respectively, matching the
existingraincloud-*skill conventions (name-only,
disable-model-invocation: true).
Changed
-
examples/is now runnable demos; authoring templates moved to
templates/.minimal_spec.jsonandstreaming_handler.py.tmpl(config
templates for adding a source) live under the new top-leveltemplates/;
examples/is reserved for code-path scripts that use theraincloud.load
API. Doc and skill references updated accordingly. -
Packaging: the project is now a hatchling-built, installable
package (installed from GitHub:pip install "raincloud @ git+https://github.com/spiraldb/raincloud", not PyPI). The wheel force-includes
docs/v1/snapshot.jsonandsources.jsonas packaged data under
raincloud/_data/, so the catalog resolves with no repo checkout. -
BREAKING (install): the heavy build toolchain moved out of the base
dependency set into the[build]extra. A bareuv sync --inexact
(or apip installfrom the GitHub repo) now installs only the lightweight loader
(pyarrow,numpy,vortex-data,fsspec); building datasets
requiresuv sync --extra build --inexact(duckdb, pandas, osmium,
pyreadstat, openpyxl, py7zr, unlzw3, zstandard, jsonschema). Transport
backends are per-scheme extras ([s3]β s3fs,[http]β aiohttp;
file://needs neither);[duckdb]/[pandas]back
Dataset.scan()/.to_pandas(). This does not change the
no-redistribution posture inDISCLAIMER.md. -
Drift is an alert, not a blocker. When a slug's sha256 is pinned
in the snapshot and the mirror or local build produces different
bytes, the loader now prints[raincloud] WARN: <slug> from <origin> sha256 drifted ...to stderr and adopts the new bytes anyway.
Upstream content changes are common and benign; the build should
still work, with the user informed. The loader's mirror-fetch path is
the strict-capable caller β underRAINCLOUD_STRICT_CHECKSUMit passes
_cache.adopt(..., strict=True)so a mirror mismatch raises
ChecksumMismatch. (scripts.pipeline.publishis a separate gate: it
refuses to upload via its ownPublishMismatch, not throughadopt.)
Adopted bytes are recorded in a.<name>.pinsidecar β the snapshot sha
reconciled against, the on-disk size, and the origin (mirror/build)
β so later loads serve them straight from cache; a genuine snapshot
revision (the pinned sha changed) still re-fetches, and a post-adoption
size change still falls through to a fresh fetch. In the default
(non-strict) mode a sha-present cache hit is served on a byte-size match
without rehashing (the multi-GB rehash-avoidance fast path), so a
same-size on-disk content swap isn't caught until strict mode forces a
rehash.Set
RAINCLOUD_STRICT_CHECKSUM=1to opt the loader into a hard gate:
for a sha-pinned slug from the mirror, a mismatch on download AND on a
cache hit raisesChecksumMismatch(the cached file is rehashed each
load, catching even same-size tampering). Sha-less slugs have nothing to
rehash against, so strict leaves their size/pin corruption check
unchanged. The local build path is never strict-gated against the
maintainer's sha: a client's rebuild legitimately differs (columnar
output is rarely bit-reproducible), so the built artefact is tagged
origin=buildin its pin and served from cache by that provenance β
even under strict β instead of being rebuilt every load. It is rebuilt
only when the snapshot pin it was built against changes (the source of
truth moved) or the cached file is corrupted. For full cryptographic
integrity, point strict deployments at a mirror.
Fixed
- Wheel-install path crashes after long-running work.
scripts/pipeline/hydrate.py(success-log + the FileNotFoundError
message),
scripts/pipeline/tighten_variant.py(workdir +
three log lines),
scripts/pipeline/overnight_profile.py
(LOG_PATH,STATE_PATH,_slug_already_built,_wipe_slug,
manifest loader),
scripts/pipeline/list_datasets.py, and
scripts/pipeline/browse.pyall routed through
REPO_ROOT / "outputs/..."or.relative_to(REPO_ROOT)β fragile
under wheel installs and anyRAINCLOUD_HOME/RAINCLOUD_OUTPUTS/
RAINCLOUD_WORKDIRredirect, where it raisedValueErrorat the
tail of a multi-hour build. All call sites now use the env-aware
display_path()/outputs_root()/raw_downloads_root()/
workdir_root()helpers. New
tests/test_pipeline_path_hermeticity.pygreps the pipeline package
on every test run and fails on regressions. - Build-availability now distinguishes a missing extra from a broken
install._resolvecaptures whyscripts.pipeline.buildcan't be
imported: a plainImportError/ModuleNotFoundError(the[build]
extra isn't installed) still yields the "installraincloud[build]"
hint, but any other module-init failure (a handler raising at top
level, a malformed packaged manifest) now surfaces the real exception
in theBuildToolingMissingmessage instead of misdirecting the user
to apip installthey've already done. - Local build failures are typed. A non-zero
scripts.pipeline.buildsubprocess now raisesBuildFailed
(aRaincloudError) instead of leaking a raw
subprocess.CalledProcessErrorpast the loader's typed-error contract. read_pinrejects non-object JSON. A torn/partial or tampered
.pinsidecar containing valid-but-non-dict JSON (42,[...]) now
returnsNonerather than a value whose later.get(...)would raise
AttributeErrorinsideresolve().- Sha-less cache files are size-checked, not trusted on existence.
For a slug with no pinned sha (most of the catalog), a cache hit serves
via the adoption pin or a snapshot-byte-size match; a cached file whose
size diverges from the snapshot with no pin vouching for it is treated
as corruption and re-fetched, rather than served on mere existence. - Catalog parquet visibility for snapshot-only slugs. When a slug
is indocs/v1/snapshot.jsonbut absent fromsources.json
(legacy / deprecated entry still on a mirror),Catalog.entry()now
exposes the parquet format via the same
snap.get('parquet_bytes') is not Noneclause that already covered
vortex. Loadable now; previously raisedFormatUnavailable. Dataset.scan()stderr note. When a slug was loaded as vortex
butscan()needs the parquet sibling (DuckDB has no Vortex
reader), the loader prints[raincloud] scan() needs parquet but <slug> was loaded as 'vortex'; resolving parquet sibling ...before
the resolve, so an implicit mirror fetch isn't a surprise..parttmp race fixed._resolve.resolvenow writes to
f".<name>.<pid>-<uuid8>.part"(per-process unique) and sweeps
stale.partsiblings older than six hours on each resolve, so
concurrent loaders no longer clobber ea...
v0.1.5 β fix ColumnsModal crash on duplicate column names
Fixed
- TUI Columns modal crash on slugs with duplicate column names.
Eleven slugs ship parquet schemas with legitimately repeated
top-level column names β theosmi-mental-health-in-tech-*survey
series (2016 through 2023) repeats "Why or why not?" follow-ups
under each yes/no item, anduci-spambase,uci-parkinsons, and
uk-price-paideach have one or more repeated headers. The new
Columns modal used the bare column name as the Textual DataTable
row key, so the second occurrence crashed withDuplicateKey.
Repeated names are now suffixed with(2),(3), etc. for
display + lookup; the by-name stats dict no longer silently
collapses entries either. The underlying parquet's column names
are unchanged.
v0.1.4 β catalog discoverability + per-column profiles for all 249 specs
Added
- Catalog discoverability β new ways to navigate the 249-spec
catalog without scrolling the fulldocs/v1/datasets.md. - TUI faceted side panel (
browse.py) β filter groups for showcase,
domain tags, size, shape traits, license, fetch type. View-preset bar
(encoding,stress) on top, selectable from theViewrow. Counts
header showsN of 249. - TUI search β
/focuses a search input above the table. Bare
tokens match any field (substring, case-insensitive); qualified
clauses (slug:foo desc:bar tag:enums col:lat lic:cc0 handler:β¦ reader:β¦ fetch:β¦) scope to one field. Clauses AND together and AND
with the facet selection. Aliases:name/desc[ription]/
tag[s]/col[umn][s]/lic[ense]. - TUI Columns-modal rendering refresh β pessimistic per-codepoint
cell-width accounting fixes Sinhala / CJK / Arabic content overflow.
Block-glyph histograms scale with pane width; new x-axis tick labels
(lo / mid / hi) under each numeric histogram and horizontal bars
for top-K string distributions. Modal widened (90% β 95%); the
unreliable yellow border replaced by$surfacebackground contrast. - Per-column profiles β new opt-in stage
python -m scripts.pipeline.profile [<slug>]produces
outputs/v1/<slug>/profile.jsonwith per-dtype stats (numeric
histograms, string NDV + top-K, bool T/F/null, date/timestamp ranges,
list/map length stats). Surfaced in the TUI's detail pane and via
list_datasets --inspect <slug>. Auto-promotes the result into
docs/v1/profiles/<slug>.jsonso fresh clones can render sparklines
without rebuilding;--no-promoteopts out. promote_profilestooling β new
python -m scripts.pipeline.promote_profilesmirrors built
profiles into the trackeddocs/v1/profiles/directory. Idempotent
(byte-identical destinations are skipped);--checkfor CI audits.- List-element dtypes in profiles.
profile.pynow renders list,
large_list, and fixed_size_list element types recursively in the
dtype label (list<float>,fixed_size_list<float>[100],
list<struct>). Downstream consumers (e.g.autotag) can
distinguish embedding-shaped columns from lists-of-structs without
re-opening the parquet. - Editorial metadata in
sources.jsonβ optionaltags(closed
vocab, 13 data-kind entries grouped by content axis:
string β urls / prose / enums / identifiers / code-strings;
numeric β timestamps / embeddings / counts / monetary / measurements;
payload β coordinates / binary-payload / nested-json) and
showcase(closed vocab, 2 tiers: encoding / stress) per
DatasetSpec.scripts.pipeline.autotagproposes tags from each
slug's profile + handler/slug-name fallbacks; hand-edit in
sources.jsonafter that like any other manifest field. - Public BI workload descriptions. All 46
bi-*slugs in the
Public BI Benchmark now carry per-workload descriptions grounded in
actual column names rather than the workbook label, with a
data-shape lead (N rows Γ M cols, dtype-family mix, notable
columns) and aBackground:note. Many workbook names mislead about
contents β e.g.bi-romanceis Instagram social posts;
bi-physiciansis CMS Medicare payment records;bi-iglocations1
is US Census geographic codes;bi-eixoandbi-uberlandiashare a
schema withbi-mulheresmil(a Brazilian education program). Two
slugs (bi-arade,bi-wins) retain a generic description because
their columns are anonymised beyond recognition. - Derived signals in
docs/snapshot.jsonβ per-slugshape_traits
(has_nested, has_timestamp, has_variant, string_heavy, wide_row,
high_cardinality_present) andsize_bucket(xs/s/m/l/xl), derived by
docs.pyfrom on-disk parquets. - CLI parity β
list_datasetsgains--tag,--showcase,--size,
--trait(with!negation),--view,--inspect,--tags-help,
--showcase-help.--inspectfalls back from the built-parquet
profile to the trackeddocs/v1/profiles/<slug>.jsonmirror, so a
fresh clone can inspect any slug in the catalog without rebuilding. - Curated-picks header in
docs/v1/datasets.mdβ one block per
showcase tier, regenerated fromsources.json. - README "Discover" subsection β directs newcomers at the TUI first.
- Skills: new
raincloud-profile, newraincloud-discover; updated
raincloud-list-datasets,raincloud-build. - Tracked profiles for all 249 specs.
docs/v1/profiles/ships a
per-slug profile for every entry in the manifest, including the
multi-hour heavyweights (clickbench-hits,fineweb-sample-10bt,
wikipedia-structured-contents,jsonbench-bluesky-100m,
osm-germany-nodes, the OpenLibrary dumps, etc.). A fresh clone can
render the TUI Columns pane and uselist_datasets --inspect <slug>
on any slug without building anything locally.
Changed
autotagenums classifier tightened. A string column counts as
enum-shaped only whenndv β€ 32 AND mean_len β€ 24, or when
ndv β€ 256 AND ndv/rows β€ 0.001 AND mean_len β€ 24for very wide
datasets. The slug-levelenumstag additionally requires β₯2
qualifying columns, so a single class-label column no longer
promotes the whole dataset to enum-shaped.autotagembeddings detection now reads the list-element dtype
written byprofile.pyand recogniseslist<float>/list<double>
/fixed_size_list<float>columns as embeddings without relying on
slug-name heuristics. The remaining slug-name fallback uses
word-boundary matching (\b(embeddings?|word vectors?|dense vector| glove|word2vec|fasttext|encoder output)\b) so unrelated copy like
"sensors embedded in β¦" no longer matches.
Removed
DatasetSpec.familyfield and--familyCLI flag. The field was
used to invoke batched builds (python -m scripts.pipeline.build --family uci); each slug is now invoked by name, and--allremains
available for whole-catalog passes. Pass multiple slugs space-separated
tobuild/convertfor ad-hoc batches.- Subject-matter
TAG_VOCAB(12 entries: geospatial / nlp-text /
web-analytics / e-commerce / finance / social / scientific /
healthcare / sports / transportation / government / benchmark)
replaced by the 13 data-kind vocab above. curation.json+scripts/pipeline/curate.py+tests/test_curate.py
removed. Tags now sit inline insources.jsonalongside
description/license/showcase. Thecurate applybridge is
gone.
Fixed
profile.pyDECIMAL overflow in histogram-bucket SQL. DuckDB was
inferring DECIMAL types from inlinedlo_f/hi_fPython repr
(e.g.0.26851799179226266β DECIMAL(18,17));(value - lo) * 10
then overflowed. All histogram-bucket literals are now::DOUBLE-cast.profile.pyzero-length identifier on empty column names. Some
upstream CSVs ship an unnamed pandas-index column whose Arrow field
hasname == ""; DuckDB rejects empty delimited identifiers. Skip
with a placeholder__unnamed_column__entry.profile.pyTIME-of-day column cast. DuckDB doesn't implement
CAST(time AS TIMESTAMP); standalone TIME columns now route through
the string profile (null_count + NDV + top-K of rendered HH:MM:SS).profile.pyfixed_size_listcolumns were silently profiled as
nullbecause the dispatcher only checkedis_list/is_large_list.
They now route through the list profile and pick up the new
element-type rendering, so e.g.glove-6b-100d's
vector: fixed_size_list<float>[100]is fully described.- WDI re-enabled. The upstream redirect target
databankfiles.worldbank.orgserves an expired TLS cert, so Python's
defaulturllibrefused the connection. The newfetch.verify_tls
field (boolean, defaulttrue) lets a slug bypass verification when
itsexpected_sha256provides independent integrity. WDI ships at
395,276 rows Γ 70 columns (70 MB parquet).
Schema
sources.schema.jsonadds three optional fields, all additive
(existing manifests are accepted unchanged):DatasetSpec.tags(array of TAG_VOCAB strings, default[]).DatasetSpec.showcase(array of SHOWCASE_TIERS strings, default[]).DatasetSpec.fetch.verify_tls(boolean, defaulttrue) β escape
hatch for upstreams whose TLS certs have rotted but whose payload
integrity is gated byexpected_sha256.
- New
profile.schema.json(Draft 2020-12) for the per-slug profile
output format.
v0.1.3 β validate warns-not-raises; schema_hash prefix match
Changed
- Validate stage no longer hard-fails on row/schema_hash drift by
default. A mismatch now emits a[WARN]line to stderr and the build
continues. Users invokingpython -m scripts.pipeline.build <slug>have
already opted into "fetch whatever is upstream now"; an upstream Arrow-
conversion bump or a slightly-grown row count shouldn't turn that into a
failed build. Pass--strict(new flag onscripts.pipeline.build) to
upgrade warnings to errors β recommended for CI / pre-release gates. - The previous
--looseflag has been removed; its behaviour (warn, don't
raise) is now the default. Migrate--looseinvocations to dropping the
flag entirely; replace any "default-strict" CI invocations with
--strict.
Fixed
validate.pynow comparesexpect.schema_hashas a prefix when the
manifest value is shorter than the full 64-char SHA-256. All 37 slugs
withschema_hashset insources.jsonuse a 12-char short hash
(matching the[validate] schema_hash=print convention, akin to git
short SHAs); the previous full-string equality made every one of them
fail validation on rebuild. Equal-length values still use strict
equality, so full hashes remain enforceable for callers that prefer
them.sources.schema.mdupdated to document the prefix-match rule and the
new warn-vs---strictsemantics for theexpectblock.
v0.1.2 β uv sync --inexact + TUI auto-installs upstream extras
Fixed
- All
uv syncinstructions across the docs (README, AGENTS, CONTRIBUTING,
SKILLS, in-code install hints, and skill files) now pass--inexactso
installing one extra no longer uninstalls the others. Without this, the
documented sequential setup (uv sync --extra tuiβ bareuv syncβ
uv sync --extra huggingface) silently left the user with only the last
extra installed, and subsequent builds of HF/Kaggle slugs failed with
ImportError. uv has no project-level toggle for this β--inexactis
per-command β so the fix is documentation-wide.
Changed
- TUI build action (
python -m scripts.pipeline.browse, thenbon a row)
now runsuv sync --extra <kaggle|huggingface> --inexactautomatically
before the build subprocess when the dataset'sfetch.typerequires an
upstream-fetch backend. Sync output streams into the same RichLog as the
build; sync failure aborts the build with a visible exit code. Pure-HTTP
and custom-fetch slugs see the same flow as before (no extra sync).
BuildConfirmModalsurfaces the sync command line above the build command
line so the user sees both before confirming.
v0.1.1 β convert streaming, docs/snapshot fallback
Added
- README badges (CI status, latest release, license, citation).
Changed
- Convert stage now streams parquet batches via
pf.iter_batches() β RecordBatchReader β vxio.writeinstead of materialising whole tables.
ResolvesArrowNotImplementedError: Nested data conversions not implemented for chunked array outputsfrom pyarrow on slugs whose nested columns
(list<struct>,struct<bytes,β¦>) would need to be chunked across multiple
Arrow arrays. Re-enables Vortex output forosm-germany-ways,
ultrachat-200k,mmmu,websight-v01,peoples-speech-clean-validation. code-contestsVortex skip re-diagnosed: not the chunked-array path; a
separate upstream FSST i32-offset overflow onlist<string>>2 GB.open-food-factsdescription aligned with shipped output (currently a
singleraw_json: stringcolumn viajsonl_as_string_parse; VARIANT
promotion deferred).- PR template: dropped the "Test plan" checklist (CI runs the same gates on
every PR; CONTRIBUTING.md documents them once). - Agent-tooling docs (AGENTS.md, SKILLS.md,
raincloud-docsskill) now flag
docs/snapshot.jsonas load-bearing β TUI fallback and the
row-count / file-size fallback fordatasets.mdregen. Stale "six derived
docs" reference in AGENTS.md cleaned up to three.
Fixed
docs/datasets.mdregeneration now falls back todocs/snapshot.json
(top-level scratch, thendocs/v{schema_version}/snapshot.jsonon a fresh
clone) for slugs whose parquet isn't built locally. Previously,
partial-build regen would silently dash-out row counts and file sizes for
any slug not on disk, destroying ground truth in the v1 snapshot. Snapshot
regen now also captureslast_built_row_groups. Five regression tests
added intests/test_docs.py.
v0.1.0 β initial public release
Initial public release.
Raincloud is a client-reproducible pipeline for building a curated catalog
of public datasets as analytics-ready Parquet + Vortex files. See
README.md for the user-facing overview,
AGENTS.md for the architecture, and
SKILLS.md for procedural playbooks.
This release bundles:
- The 7-stage build pipeline (fetch β extract β parse β transform β write
β validate β convert) plus the optional opt-in hydrate stage. - 249 dataset specs across 5 families (
direct,kaggle-upstream,
nyc-tlc,public-bi,uci). - 24 named transform handlers covering CSV / Parquet / JSONL / XML / PBF /
custom-format upstreams plus streaming variants for memory-constrained
shapes. - A read-only Textual TUI for browsing the catalog
(python -m scripts.pipeline.browse, requires--extra tui). - Per-dataset Vortex conversion via the
convert.vortexflag. - Apache License 2.0, with SPDX file headers on all Python sources.
- Governance:
SECURITY.md,CONTRIBUTING.md,CODE_OF_CONDUCT.md
(Contributor Covenant 2.1),DISCLAIMER.md(AS IS posture, content
and license disclaimers, dataset-removal reporting), and
HYDRATING.md(policy for the optional hydrate stage). - Tooling:
rufflint (rulesE,F,W,I) + GitHub Actions CI
(.github/workflows/ci.yml) running lint, manifest validation, and
pyteston every push and PR todevelop. - Dataset-removal issue template
(.github/ISSUE_TEMPLATE/dataset-removal.yml) β structured form for
the channelDISCLAIMER.mdpoints readers at. - Pull-request template (
.github/pull_request_template.md) prompting
for summary, test-plan checkbox list against the standard pre-PR gate,
and change-type tags. CITATION.cffβ GitHub-native citation metadata; surfaces the "Cite
this repository" button in the repo sidebar with BibTeX / APA / Chicago
exports.