fix(store): apply repairable cloud-upgrade mutations even when a blocker is queued#388
Draft
Will2406 wants to merge 1 commit into
Conversation
…ker is queued
Previously RepairCloudUpgrade short-circuited the entire repair pass as soon as
DiagnoseCloudUpgradeLegacyMutations reported any blocked finding, including in
--apply mode. This forced operators with a single unrecoverable legacy mutation
(e.g. a session row whose `directory` was stored as the empty string by an
older engram version) to perform manual SQLite surgery before any of the other
queued mutations could be repaired. The error message also reported
`Findings[0]`, which is ordered by `seq` and is often a *repairable* finding,
not the actual blocker — so users saw the wrong seq/entity in the manual-action
guidance.
This change:
- Applies the repairable subset first in `RepairCloudUpgrade` and only then
surfaces residual blockers. The returned report now carries `Applied=true`
on partial successes and explains both buckets in the message
("applied N repairable payload(s); M remain blocked: ...").
- Selects the first non-repairable finding for the manual-action message so
the seq/entity/op shown to the user identifies the actual blocker.
- Adds `entity_key=%q` to the blocker description so operators can locate the
offending row directly without cross-referencing seq → entity_key.
Tests cover the new mixed-bucket dry-run and apply paths plus the seq-ordering
guarantee.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
While investigating slow Engram responses against a long-lived project, the doctor reported:
and
engram cloud upgrade repair --project X --dry-runreturned:The reported
seq=9was misleading — that row was actually repairable (its localsessionsrow had a validdirectory). The real blocker was a much later seq whose localsessionsrow had an emptydirectory, which the per-mutation evaluator at `evaluateCloudUpgradeLegacyMutationTx` correctly refused to infer.The reproduction tree was:
`DiagnoseCloudUpgradeLegacyMutations` correctly surfaced this as `RepairableCount=179, BlockedCount=1`. But the gating in `RepairCloudUpgrade` aborted the whole pass on `BlockedCount > 0`, never giving the repairable 179 a chance to apply.
Bugs fixed
Changes
`(seq=524 entity=session entity_key="distracted-antonelli-7bad05" op=upsert)`
Tests
Added two subtests to `TestUpgradeRepairDryRunAndApply` in `internal/store/store_test.go`:
Existing tests in the same function (including `legacy mutation required fields are detected and repaired from authoritative local state`) continue to pass unchanged. Full `./internal/store/`, `./internal/diagnostic/` and `./cmd/engram/` suites pass locally on Go 1.26.3.
Risk
Repro / verification
To reproduce the original deadlock on `main`:
```bash
In a fresh sqlite db backing engram:
INSERT INTO sessions (id, project, directory, started_at) VALUES ('orphan', 'p', '', datetime('now'));
Plus the matching malformed sync_mutations + one repairable row.
engram cloud upgrade repair --project p --apply
main: applied=false, blocked. queue stuck.
this PR: applied=true, blocked (with N remaining), repairables patched.
```
A local DB patch (`UPDATE sessions SET directory='...' WHERE id='orphan'`) is still required for the orphan itself — this PR does not invent directory values out of thin air, it only stops penalizing the recoverable rows.
Draft because I'd like a maintainer pass on the message format change before flipping to ready.