Skip to content

fix(store): apply repairable cloud-upgrade mutations even when a blocker is queued#388

Draft
Will2406 wants to merge 1 commit into
Gentleman-Programming:mainfrom
Will2406:fix/cloud-upgrade-repair-applies-repairable-first
Draft

fix(store): apply repairable cloud-upgrade mutations even when a blocker is queued#388
Will2406 wants to merge 1 commit into
Gentleman-Programming:mainfrom
Will2406:fix/cloud-upgrade-repair-applies-repairable-first

Conversation

@Will2406
Copy link
Copy Markdown

Context

While investigating slow Engram responses against a long-lived project, the doctor reported:

sync_mutation_required_fields: blocked (180 finding(s))

and engram cloud upgrade repair --project X --dry-run returned:

class: blocked
reason_code: upgrade_blocked_legacy_mutation_manual
message: manual-action-required: session payload is missing required fields (seq=9 entity=session op=upsert)
applied: false

The reported seq=9 was misleading — that row was actually repairable (its local sessions row had a valid directory). The real blocker was a much later seq whose local sessions row had an empty directory, which the per-mutation evaluator at `evaluateCloudUpgradeLegacyMutationTx` correctly refused to infer.

The reproduction tree was:

seq order entity_key local `sessions.directory` evaluator verdict
9 (lowest) manual-save-... `/Users/.../worktrees/...` repairable
524 distracted-... `""` blocked
729..1305 many sessions `/Users/.../proveido` repairable

`DiagnoseCloudUpgradeLegacyMutations` correctly surfaced this as `RepairableCount=179, BlockedCount=1`. But the gating in `RepairCloudUpgrade` aborted the whole pass on `BlockedCount > 0`, never giving the repairable 179 a chance to apply.

Bugs fixed

  1. Repair gating (`store.go` ~line 1227 before this patch): even with `apply=true`, the function returned `Class=Blocked, Applied=false` whenever a single blocked finding existed, leaving all repairable mutations queued. Operators had to perform manual SQLite surgery on the blocker before any of the recoverable mutations could be unblocked.
  2. Misleading error message: `first := legacyReport.Findings[0]` picks the lowest-seq finding regardless of repairability, so users saw guidance like "seq=9 entity=session" pointing at a row that was actually fine. The genuine blocker (seq=524, entity_key="distracted-...") was never named.
  3. No entity_key in the manual-action message: even when the seq was right, mapping seq → entity_key required SQLite access, which is a friction step for anyone trying to do the manual repair the message asks for.

Changes

  • `RepairCloudUpgrade` now applies the repairable subset first when `apply=true` and reports residual blockers afterwards. `Applied=true` on partial successes. The message describes both buckets:
    • `apply=false` mixed: `"N repairable payload(s) would apply; M would remain blocked: manual-action-required: ..."`
    • `apply=true` mixed: `"applied N repairable payload(s); M remain blocked: manual-action-required: ..."`
  • The manual-action finding is now selected by repairability, not by seq order — `for _, f := range Findings { if !f.Repairable { ... } }` — so the reported `seq/entity/op` always identifies an actual blocker.
  • The blocker description gains `entity_key=%q` so users can locate the offending row directly:
    `(seq=524 entity=session entity_key="distracted-antonelli-7bad05" op=upsert)`

Tests

Added two subtests to `TestUpgradeRepairDryRunAndApply` in `internal/store/store_test.go`:

  • `repair applies repairable mutations even when a blocker is queued ahead of them`: verifies dry-run reports both buckets, `apply=true` writes the repaired payload to disk, the orphan stays untouched, and the diagnosis after partial repair has `RepairableCount=0, BlockedCount=1`.
  • `blocker error message names the unrecoverable seq even when a lower-seq repairable exists`: inserts the repairable mutation first (so it gets a lower seq), then the orphan; asserts the error message references the orphan's seq, not the repairable's.

Existing tests in the same function (including `legacy mutation required fields are detected and repaired from authoritative local state`) continue to pass unchanged. Full `./internal/store/`, `./internal/diagnostic/` and `./cmd/engram/` suites pass locally on Go 1.26.3.

Risk

  • Behavior change: a project with both repairable and blocked mutations will now have its repairable subset mutated on `--apply` instead of bailing. This matches what the existing `applyCloudUpgradeLegacyMutationRepairs` already does internally — it iterates and skips non-repairables — and prevents the previous "all or nothing" deadlock. The blocked finding is still surfaced (the report class remains `Blocked`), so downstream tooling that gates on `Class == Blocked` keeps working.
  • Message format: the manual-action guidance now includes `entity_key=%q`. Any tooling that regex-matches the previous format `(seq=X entity=Y op=Z)` will need to accept the extended form.

Repro / verification

To reproduce the original deadlock on `main`:

```bash

In a fresh sqlite db backing engram:

INSERT INTO sessions (id, project, directory, started_at) VALUES ('orphan', 'p', '', datetime('now'));

Plus the matching malformed sync_mutations + one repairable row.

engram cloud upgrade repair --project p --apply

main: applied=false, blocked. queue stuck.

this PR: applied=true, blocked (with N remaining), repairables patched.

```

A local DB patch (`UPDATE sessions SET directory='...' WHERE id='orphan'`) is still required for the orphan itself — this PR does not invent directory values out of thin air, it only stops penalizing the recoverable rows.


Draft because I'd like a maintainer pass on the message format change before flipping to ready.

…ker is queued

Previously RepairCloudUpgrade short-circuited the entire repair pass as soon as
DiagnoseCloudUpgradeLegacyMutations reported any blocked finding, including in
--apply mode. This forced operators with a single unrecoverable legacy mutation
(e.g. a session row whose `directory` was stored as the empty string by an
older engram version) to perform manual SQLite surgery before any of the other
queued mutations could be repaired. The error message also reported
`Findings[0]`, which is ordered by `seq` and is often a *repairable* finding,
not the actual blocker — so users saw the wrong seq/entity in the manual-action
guidance.

This change:

- Applies the repairable subset first in `RepairCloudUpgrade` and only then
  surfaces residual blockers. The returned report now carries `Applied=true`
  on partial successes and explains both buckets in the message
  ("applied N repairable payload(s); M remain blocked: ...").
- Selects the first non-repairable finding for the manual-action message so
  the seq/entity/op shown to the user identifies the actual blocker.
- Adds `entity_key=%q` to the blocker description so operators can locate the
  offending row directly without cross-referencing seq → entity_key.

Tests cover the new mixed-bucket dry-run and apply paths plus the seq-ordering
guarantee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant