Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devague/current
Original file line number Diff line number Diff line change
@@ -1 +1 @@
data-refinery-cli-ships-the-storage-data-quality-i
data-refinery-now-owns-store-file-migration-a-cons
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
{
"slug": "data-refinery-now-owns-store-file-migration-a-cons",
"title": "data-refinery now owns store-file migration: a consumer upgrades an on-disk store to the current Envelope format by supplying only a transform, never constructing a filesystem write path \u2014 files granularity first",
"schema_version": 1,
"status": "exported",
"created": "2026-06-21T14:46:10Z",
"updated": "2026-06-21T14:57:43Z",
"claims": [
{
"id": "c1",
"kind": "announcement",
"text": "data-refinery now owns store-file migration: a consumer upgrades an on-disk store to the current Envelope format by supplying only a transform, never constructing a filesystem write path \u2014 files granularity first",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h4",
"text": "the endpoint ships BOTH importable (store.migrate) and as a CLI verb (store migrate), both documented in the pinnable contract.md with a version bump",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c2",
"kind": "audience",
"text": "eidetic-cli (first consumer over the import + subprocess boundary) and any future consumer of data-refinery's store boundary",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h5",
"text": "eidetic can reach the endpoint over BOTH the import boundary (callable transform) and the subprocess boundary (self-canonicalize); no third component is needed",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c3",
"kind": "after_state",
"text": "a consumer upgrades a populated legacy on-disk store to the current Envelope-JSONL format by calling data_refinery.store.migrate(transform) (import) or 'data-refinery store migrate' (subprocess), supplying only a transform/target format \u2014 data-refinery resolves the store root and owns the atomic per-file rewrite",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h6",
"text": "in the eidetic call site, the only argument eidetic supplies is a transform callable (and optionally the store root it already owns) \u2014 never a constructed per-file *.jsonl.tmp path",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c4",
"kind": "before_state",
"text": "eidetic's migrate_store.py globs the operator-supplied store dir, writes *.jsonl.tmp then os.replace; SonarCloud flags that consumer-side write sink as pythonsecurity:S2083 BLOCKER, which is structurally unsatisfiable for a local CLI and fails eidetic's gate",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h7",
"text": "the S2083 BLOCKER is on eidetic's write sink in migrate_store.py and is unsatisfiable there because writing into the operator's chosen dir IS the feature",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c5",
"kind": "why_it_matters",
"text": "the path-construction concern (and the S2083 sink) belongs to the component that OWNS the storage layout; moving it behind data-refinery's boundary lets eidetic delete migrate_store.py and go green without any in-repo rule suppression",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h8",
"text": "after the cutover eidetic's gate clears with zero in-repo suppression (no # NOSONAR, no sonar exclusion entry for migrate_store.py)",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c6",
"kind": "success_signal",
"text": "eidetic deletes migrate_store.py + its tests and replaces 'eidetic migrate store' with a thin call into data-refinery; eidetic's S2083 BLOCKER disappears and its gate goes green with no rule suppression; re-running migrate converts nothing (idempotent); an interrupted run is safe to resume (atomic per file)",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h9",
"text": "all four issue-#8 acceptance criteria are demonstrably met by a live test: upgrade-without-path, idempotent, atomic-per-file, eidetic deletes the module",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c7",
"kind": "boundary",
"text": "no eidetic Record/memory semantics leak into data-refinery; files backend granularity FIRST (mongo/vectors then neo4j/graph are later granularities); not a general ETL framework",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h10",
"text": "mongo/neo4j migration get a clean extension seam (a backend-level hook) but only the files backend actually rewrites now; data-refinery never imports eidetic's Record schema",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c8",
"kind": "decision",
"text": "the importable store.migrate(transform) takes a Python callable Callable[[dict], Envelope|None]; the 'data-refinery store migrate' CLI verb canNOT cross a callable over argv, so it only re-canonicalizes data-refinery's OWN Envelope-JSONL (re-validate + re-fill hash + atomic rewrite) \u2014 a self-heal/format-version bump, never a consumer transform",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [],
"hard_questions": [],
"links": []
},
{
"id": "c9",
"kind": "requirement",
"text": "the rewrite is atomic per file (tmp sibling in the same dir + os.replace) and idempotent (a file already in target format is left byte-identical; a re-run converts nothing)",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h1",
"text": "running migrate twice over the same store yields a byte-identical store on the second run, and killing the process mid-rewrite leaves either the old or the new file intact (never a partial/truncated file), because os.replace is atomic on POSIX",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c10",
"kind": "requirement",
"text": "data-refinery resolves and validates the store root internally (canonicalize via os.path.realpath + containment-check via os.path.commonpath against an owner-controlled root); the consumer supplies a root directory or a transform, never a constructed per-file write path",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h2",
"text": "a migrate call whose resolved per-file path escapes the canonicalized store root (e.g. via a symlink) is refused with a structured code-2 CliError, and Sonar's S2083 taint is satisfiable here because the sink reasons against an owner-canonicalized root rather than a raw consumer arg",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
},
{
"id": "c11",
"kind": "requirement",
"text": "every transformed line is validated against the Envelope shape and the public/private scope no-leak (can_serve) before being written; an unparseable/invalid legacy line fails the migration with a structured CliError, never a traceback",
"origin": "user",
"status": "confirmed",
"honesty_conditions": [
{
"id": "h3",
"text": "a legacy line that does not transform into a valid Envelope (bad shape, or an unrecognised scope.visibility) aborts the file's migration before any os.replace, leaving the original file untouched, and emits error:/hint: on stderr with no traceback",
"status": "confirmed"
}
],
"hard_questions": [],
"links": []
}
],
"open_vagueness": []
}
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -234,3 +234,5 @@ skills.local.yaml

# data-refinery stack bind-mount data (docker-compose volumes)
.data/

.devague/reviews/
9 changes: 8 additions & 1 deletion AGENTS.colleague.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,14 @@ files/mongo/neo4j `Backend`, also importable as `data_refinery.store`), and a
(`{id, hash, content, scope, metadata}`) and never interprets them as "memories"
— that semantics stays in eidetic, the first consumer over a
subprocess-not-import boundary. Waves 1 (stack) and 2 (store + quality) are
built; Wave 3 (the pinned verb contract + eidetic consumption) is open.
built; Wave 3's first slice (issue #8) — the **store-migration endpoint**
`data_refinery.store.migrate(transform, …)` + `data-refinery store migrate` — is
built: a consumer upgrades a populated store to the current Envelope format by
supplying only a *transform* (never a filesystem write path), so the rewrite —
and its path-construction concern — lives behind data-refinery's boundary. It is
**atomic per file** (temp sibling + `os.replace`) and **idempotent**
(byte-identical 2nd run); **files granularity only** today (mongo/neo4j raise).
The rest of Wave 3 (the pinned verb contract + eidetic consumption) is open.

## Names (keep them straight)

Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ All notable changes to this project will be documented in this file.
Format follows [Keep a Changelog](https://keepachangelog.com/). This project
adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.0] - 2026-06-21

### Added

- store migration endpoint (issue #8): importable data_refinery.store.migrate(transform, *, backend, base_dir, dry_run) mirrored by the data-refinery store migrate CLI verb — a consumer upgrades a populated on-disk store to the current Envelope format by supplying only a transform, never constructing a filesystem write path (the rewrite, and its path-construction concern, lives behind data-refinery's boundary).
- store migrate is atomic per file (temp sibling + os.replace) and idempotent (a second run rewrites nothing, byte-identical); files granularity only today (mongo/neo4j raise a structured CliError as the files-first seam).

### Changed

- FilesBackend writes are now atomic via a shared _atomic_write helper (temp sibling + os.replace), hardening the day-to-day upsert/delete path against truncate-on-crash, not just migration.
- store migrate validation is now whole-store: every scope file is transformed and validated before any write, so a corrupt line / invalid transform output / symlink escape in any file aborts the whole migration before it touches disk (was per-file). Orphan temp-file reaping moved to the start of the run. (Folded from a colleague review pass.)
- store I/O faults now obey the exit-code contract: an unreadable/unwritable scope file (permissions, full disk, a failed os.replace) surfaces as a structured CliError with exit code 2 and an actionable remediation, and a valid-JSON-but-non-object line or a record missing its id surfaces as a code-2 "corrupt line" — instead of a raw OSError/AttributeError/KeyError wrapped by the dispatcher as a generic code-1 "unexpected" error. Applied to both the migration and the day-to-day load path via a shared _atomic_write / _corrupt_line. (Folded from a Qodo review pass on PR #9.)
- docs/contract.md is now contract version 3 (adds the store-migration endpoint).

## [0.5.2] - 2026-06-21

### Changed
Expand Down
34 changes: 27 additions & 7 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,21 @@ Protocol; plus the **data-quality verbs** (`validate` / `dedup` / `integrity` /
lazy-imported behind the optional `[store]` extra. The remaining code is the
inherited *agent-first introspection scaffold* (`whoami` / `learn` / `explain` /
`overview` / `doctor` + a `cli` noun), cited from
[teken](https://github.com/agentculture/teken)'s `python-cli` reference. **Wave
3** (the full pinnable verb-JSON contract + eidetic consuming the surface over
the subprocess boundary) is still open; the build order lives in **issue #1** /
**issue #3** (see "Domain roadmap").
[teken](https://github.com/agentculture/teken)'s `python-cli` reference. *Wave 3,
first slice* (issue #8): the **store-migration endpoint** — the importable
`data_refinery.store.migrate(transform, *, backend, base_dir, dry_run)`
(`data_refinery/store/migrate.py`) mirrored by `data-refinery store migrate`, so a
consumer (eidetic-cli) upgrades a populated store to the current Envelope format
by supplying only a *transform* and **never constructing a filesystem write
path** — moving the path-construction (and eidetic's `S2083`) sink to the
storage owner. The rewrite is atomic per file (temp sibling + `os.replace`, a
shared `_atomic_write` that also hardened the day-to-day `upsert`) and idempotent
(byte-identical 2nd run). **Files granularity only** today; `mongo` (vectors) /
`neo4j` (graph) raise a structured `CliError` (the files-first seam). The CLI
verb self-canonicalises data-refinery's own format (no callable crosses argv).
The rest of **Wave 3** (freezing the full pinnable verb-JSON contract + eidetic
consuming the surface over the subprocess boundary) is still open; the build
order lives in **issue #1** / **issue #3** / **issue #8** (see "Domain roadmap").

## Names: there are three, and they differ on purpose

Expand Down Expand Up @@ -254,11 +265,20 @@ the default), the importable `data_refinery.store.put/get/list` mirrored by the
(`data_refinery/quality/`). Idempotent dedup + the public/private scope no-leak
(`can_serve`, enforced by every backend's `get`/`list`) are the load-bearing
invariants. README, `AGENTS.colleague.md`, `learn`, `overview`, the explain
catalog, and `docs/contract.md` (now contract version 2) were updated for the
surface. What is still open:
catalog, and `docs/contract.md` (now contract version 3) were updated for the
surface. *Wave 3, first slice* (issue #8) = the **store-migration endpoint**:
`data_refinery.store.migrate(transform, *, backend, base_dir, dry_run)` +
`data-refinery store migrate`, so a consumer upgrades a populated store to the
current Envelope format supplying only a transform (never a write path), behind
data-refinery's boundary — which lets eidetic delete its path-constructing
`migrate_store.py` and clears its `S2083` BLOCKER. Atomic per file
(`_atomic_write`, also applied to `upsert`) + idempotent (byte-identical 2nd run)
are the new load-bearing invariants; files granularity only (mongo/neo4j raise).
What is still open:

1. **Wave 3 — the full pinnable verb contract + eidetic consumption** over the
subprocess boundary (eidetic drops/thins `neo4j`+`pymongo`). The verb-JSON
subprocess boundary (eidetic drops/thins `neo4j`+`pymongo`, and replaces
`eidetic migrate store` with a thin call into `store.migrate`). The verb-JSON
shapes are documented in `docs/contract.md`; Wave 3 freezes them as the pinned
surface eidetic consumes process-to-process.

Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ uv run teken cli doctor . --strict # the agent-first rubric gate CI runs
echo '{"id":"a","content":"hello"}' | uv run data-refinery store put --json
uv run data-refinery store get a --json
uv run data-refinery store list --json
uv run data-refinery store migrate --json # re-canonicalise own JSONL (idempotent)
uv run data-refinery integrity --json # hash matches content?
uv run data-refinery dedup --json # collapse same-hash dups (idempotent)
echo '{"id":"a","content":"x"}' | uv run data-refinery validate --json
Expand All @@ -57,6 +58,10 @@ import data_refinery.store as store
store.put(store.Envelope(id="a", content="hello"))
store.get("a") # -> Envelope | None
store.list() # -> list[Envelope]

# Upgrade a populated legacy store to the current Envelope format — the consumer
# supplies only a transform, never a filesystem write path (data-refinery owns it):
store.migrate(record_to_envelope, base_dir="/path/to/store")
```

## CLI
Expand All @@ -65,6 +70,7 @@ store.list() # -> list[Envelope]
|------|--------------|
| `stack up\|down\|status` | Manage the storage substrate (mongo + neo4j) via docker compose. |
| `store put\|get\|list` | Put/get/list opaque envelopes (`--backend files\|mongo\|neo4j`). |
| `store migrate` | Re-canonicalise the store's own Envelope-JSONL (atomic, idempotent); consumers import `store.migrate(transform)` to upgrade a legacy store. |
| `validate` | Check envelope shape for JSON piped on stdin. |
| `dedup` | Collapse same-hash-same-scope duplicates in the store (idempotent). |
| `integrity` | Check every stored hash matches `sha256(content)`. |
Expand Down
40 changes: 40 additions & 0 deletions data_refinery/cli/_commands/store.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from data_refinery.cli._errors import EXIT_USER_ERROR, CliError
from data_refinery.cli._output import emit_result
from data_refinery.store import get_backend
from data_refinery.store import migrate as store_migrate
from data_refinery.store.envelope import DEFAULT_SCOPE, Envelope, Scope

_BACKENDS = ("files", "mongo", "neo4j")
Expand Down Expand Up @@ -125,6 +126,31 @@ def cmd_store_list(args: argparse.Namespace) -> int:
return 0


def cmd_store_migrate(args: argparse.Namespace) -> int:
"""Re-canonicalise the store's own Envelope-JSONL (self-heal / format bump).

The CLI boundary cannot carry a Python transform, so this verb only
normalises data-refinery's **own** format — re-validate each line, re-fill a
missing hash, rewrite atomically per file. A consumer upgrading a *legacy*
format imports ``data_refinery.store.migrate(transform)`` and supplies the
transform there; over the CLI there is no write path for a consumer to
construct. The store root comes from ``DR_DATA_DIR`` (never a flag), so the
owner — not the caller — resolves where the rewrite lands.
"""
json_mode = bool(getattr(args, "json", False))
result = store_migrate(transform=None, backend=args.backend, dry_run=args.dry_run)
if json_mode:
emit_result(result, json_mode=True)
else:
verb = "would migrate" if args.dry_run else "migrated"
emit_result(
f"{verb} {result['migrated']}/{result['files']} file(s); "
f"{result['skipped']} already canonical",
json_mode=False,
)
return 0


def _store_overview(args: argparse.Namespace) -> int:
"""`data-refinery store` with no sub-verb prints the noun's overview."""
from data_refinery.cli._commands.overview import emit_overview
Expand All @@ -136,6 +162,7 @@ def _store_overview(args: argparse.Namespace) -> int:
"store put — upsert an envelope (JSON on stdin, or --id/--content)",
"store get <id> — fetch an envelope visible to a scope",
"store list — list envelopes visible to a scope",
"store migrate — re-canonicalise the store's own Envelope-JSONL (self-heal)",
],
},
{
Expand Down Expand Up @@ -216,6 +243,19 @@ def register(sub: argparse._SubParsersAction) -> None:
_add_json_flag(list_p)
list_p.set_defaults(func=cmd_store_list)

mig = verb.add_parser(
"migrate",
help="Re-canonicalise the store's own Envelope-JSONL (self-heal / format bump).",
)
_add_backend_flag(mig)
mig.add_argument(
"--dry-run",
action="store_true",
help="Report what would change without writing.",
)
_add_json_flag(mig)
mig.set_defaults(func=cmd_store_migrate)

ov = verb.add_parser("overview", help="Describe the store noun.")
_add_json_flag(ov)
ov.set_defaults(func=_store_overview)
Loading
Loading