feat: store-migration endpoint (files granularity) — closes #8#9
Conversation
Expose a files/migration endpoint so a consumer upgrades a populated on-disk store to the current Envelope format WITHOUT constructing any filesystem write path itself — moving the path-construction concern (and eidetic's S2083 BLOCKER sink) to the component that owns the storage layout. - data_refinery/store/migrate.py: public migrate(transform=None, *, backend, base_dir, dry_run). Consumer supplies only a transform (each decoded legacy line -> Envelope, or None to drop); data-refinery resolves the root and owns the atomic per-file rewrite. transform=None self-canonicalises DR's own Envelope-JSONL. Files-only today; mongo/neo4j raise a structured CliError. - FilesBackend.migrate + shared _atomic_write (temp sibling + os.replace), which also hardens the day-to-day upsert/delete against truncate-on-crash. Symlink escape from the canonical root is refused (code 2). Idempotent: a file whose canonical re-serialisation equals its bytes is left untouched (byte-identical 2nd run). Already-canonical lines are kept verbatim so a re-run never re-applies the consumer transform. - CLI `data-refinery store migrate` (--backend/--dry-run/--json, no --data-dir): self-canonicalise only (no callable crosses argv); resolves root from DR_DATA_DIR, keeping the write sink off DR's own Sonar taint path. - Docs: contract v3 (+ importable migrate + invariants), explain catalog, store overview, README, CLAUDE.md / AGENTS.colleague.md Wave-3 status. - Tests: idempotency, atomic/abort-safety, transform path, self-canonicalise, validate-before-write, symlink refusal, dry-run, files-first seam, CLI verb. - devague spec under docs/specs/ (the converged /think frame). Bump 0.5.2 -> 0.6.0 (new verb). Local gate green: 169 tests, rubric 26/26, black/isort/flake8/bandit/markdownlint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AyLu1fvaV2Ys1ZQmUUeGAc
…able commonpath guard Lift FilesBackend coverage 95% -> 97% on the migration path: - blank/whitespace lines in a scope file are skipped (not preserved) - an orphan *.jsonl.tmp from a prior interrupted run is reaped on the next migrate - mark the commonpath ValueError branch (Windows different-drives / mixed roots) # pragma: no cover — two absolute resolved POSIX paths never raise it Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AyLu1fvaV2Ys1ZQmUUeGAc
…old-in) Folds the substantive findings from a colleague review + hands-on live-test of the issue-#8 store-migration endpoint. The review confirmed atomicity, path-safety, scope-no-leak, and the files-first seam are sound, and its one "idempotency bug" was a false positive (it misread which dict is on disk on the 2nd run); the hands-on live-test passed all six guarantees with no bugs. Real improvements taken: - migrate() is now two-pass: transform + validate EVERY scope file before writing one byte, so a corrupt line / invalid transform output / symlink escape in any file aborts the whole migration before it touches disk (whole-store abort-safety, strictly stronger than the prior per-file abort). - reap orphan *.jsonl.tmp at the START of a run (clear a prior crash's debris before planning), not after. - sharpen the idempotency docstrings (migrate.py / _to_envelope): the consumer's transform need not be idempotent — already-canonical lines are kept verbatim, resting on the Envelope round-trip being a stable fixpoint (the thing the review misread). Tests (+3, 22 total in the migrate suite; 157 pass overall): - test_envelope_round_trip_is_a_fixpoint — guards the foundation of the contract - test_non_idempotent_transform_is_applied_exactly_once — a marker-stamping transform runs once; the 2nd run keeps the canonical line verbatim - test_whole_store_validation_aborts_before_any_write — a corrupt 2nd file leaves the (sorted-first) first file untouched Gate: rubric 26/26, black/isort/flake8/bandit/markdownlint clean, files.py 98% / migrate.py 100%. Version stays 0.6.0 (unreleased); CHANGELOG entry amended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AyLu1fvaV2Ys1ZQmUUeGAc
|
/agentic_review |
Code Review by Qodo
Context used✅ Compliance rules (platform):
15 rules 1.
|
PR Summary by QodoAdd store migration endpoint with atomic, idempotent JSONL rewrites (files backend) Description
Diagram
High-Level Assessment
Files changed (15)
|
…odo PR #9) Triages the two findings from the Qodo agentic review of PR #9; both were real gaps against the repo's own error contract (CLAUDE.md: code 2 for environment/setup faults, structured CliError for every failure, no generic code-1 "unexpected" wrap for known failure modes) — FIX, not pushback. 1. Raw OSError -> code-2 CliError (Qodo "_atomic_write re-raises OSError", Rule 1077531). A scope file that can't be read (permissions) or written (full disk, denied permission, failed os.replace) was re-raised as a bare OSError and caught by the dispatcher's last-resort wrapper as EXIT_USER_ERROR (1) "unexpected: OSError ..." with a "file a bug" remediation. Now the shared _atomic_write converts the fault (after temp cleanup) to CliError(code=2) with a space/permissions remediation, and migrate() wraps the read the same way — so upsert/delete/migrate all obey the contract. 2. Non-object / missing-key line -> code-2 "corrupt line" (Qodo "Non-dict JSON breaks migrate"). A valid-JSON-but-non-object line ([], "x", 1) or a record missing its id raised AttributeError/KeyError that _migrate_lines didn't catch (only JSONDecodeError), again becoming a generic code-1 wrap. Now both _migrate_lines and the day-to-day _load enforce dict-ness and map shape errors (KeyError/TypeError/AttributeError/ValueError) to a structured code-2 "corrupt line" via a shared _corrupt_line helper; a structured CliError (e.g. unknown visibility) is passed through with its own code. Dropped the now-unnecessary # type: ignore[arg-type] on _to_envelope. Same fix applied to the read path (_load), which shared the latent bug. Tests (+6, 31 in the migrate suite; 166 pass overall): non-object line, missing-id line, unreadable scope file (read OSError), write OSError surfaces code 2, self-canonicalise bad-visibility passthrough, and read-path parity for _load (malformed lines, bad visibility, blank-line skip). Gate: rubric 26/26, black/isort/flake8/bandit/markdownlint clean, files.py 100% / migrate.py 100%. Version stays 0.6.0 (unreleased); CHANGELOG amended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AyLu1fvaV2Ys1ZQmUUeGAc
|
Triage of the Qodo review — both findings: FIX (pushed in Both were real gaps against this repo's own error contract (CLAUDE.md: exit code 1.
2. Non-dict JSON breaks migrate (Bug). Correct. A valid-JSON-but-non-object line (
Verification. +6 tests (non-object line, missing- Thanks for the pass — both were worth fixing.
|
|



What this does
Closes #8 — the store-migration endpoint, the first slice of Wave 3. A consumer (eidetic-cli first) upgrades a populated on-disk store to the current
Envelopeformat by supplying only a transform — and never constructing a filesystem write path itself. The rewrite, and its path-construction concern, now live behind data-refinery's boundary (the component that owns the store layout). This is what lets eidetic delete its path-constructingmigrate_store.pyand clears its SonarCloudpythonsecurity:S2083BLOCKER without rule suppression.Surface
data_refinery.store.migrate(transform=None, *, backend="files", base_dir=None, dry_run=False)— the consumer passes a callabledict -> Envelope | None(returnNoneto drop a record) and optionally a store root it already owns.data-refinery store migrate [--backend files|mongo|neo4j] [--dry-run] [--json]— self-canonicalises data-refinery's own JSONL (no Python callable crosses argv), resolving the store dir fromDR_DATA_DIR.mongo(vectors) /neo4j(graph) raise a structuredCliError— the files-first seam, in the order requested.Load-bearing guarantees
_atomic_write(temp sibling +os.replace); a crash leaves each file fully-old or fully-new, never partial. (This also hardened the day-to-dayupsert/deletepath.)migrated: 0); already-canonical lines are kept verbatim, so even a non-idempotent transform is applied exactly once. Rests on theEnveloperound-trip being a stable fixpoint (test-guarded).visibilityaborts before any write (fails closed).CliErrorwith ahint:and a documented exit code.Verification
Three independent passes, all green:
os.replace, validation-abort).Gate: 157 tests pass (+4 skipped live-stack), agent-first rubric 26/26,
black/isort/flake8/bandit/markdownlintclean, coveragemigrate.py100% /files.py98%. Version 0.6.0.Not in this PR (rest of Wave 3)
Freezing the full pinnable verb-JSON contract + eidetic consuming the surface over the subprocess boundary (eidetic drops/thins
neo4j+pymongoand replaceseidetic migrate storewith a thin call intostore.migrate). Tracked in #1 / #3 / #8.🤖 Generated with Claude Code
https://claude.ai/code/session_01AyLu1fvaV2Ys1ZQmUUeGAc