feat(sst): Lightsail durability — auto-snapshot + Pulumi protect#7
Merged
Merged
Conversation
Layers stacked, in order of intervention: - Daily Lightsail auto-snapshot at 03:00 UTC (23:00 ET), 7-day rolling retention. Captures the full boot disk, including ad-hoc apt installs and /etc edits made over SSH — things IaC doesn't see. - Pulumi `protect: true` on the Instance refuses any operation that would destroy or replace the VM. Closes the door that caused two unwanted nukes this week (key-drift, then the migration replace). - Pulumi `retainOnDelete: true` orphans the AWS resource if SST ever does decide to delete (stage rename, etc.) rather than destroying. These complement the existing app-level `removal: "retain"` — that one only fires on `sst remove`, these fire on every operation. SST state file overlap with `protect` is intentional defense in depth. New RECOVERY.md at repo root documents the three restore scenarios (VM alive, VM gone but snapshot fresh, VM gone and snapshot aged out), the intentional-replace flow (unprotect → deploy → re-protect), the SST state reconciliation paths, and the auth implications after any restore. Cost impact: ~$0.05/mo for snapshot storage at typical usage (~1 GB used, daily incremental deltas). Out of scope (deliberately): OAuth DB backup, pre-deploy CI snapshot step, attached block storage. Re-auth across MCP clients is accepted as a minor inconvenience. https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
5 tasks
aliasunder
pushed a commit
that referenced
this pull request
May 11, 2026
… entries ARCHITECTURE.md: - Auth method table: "no-expiry refresh" → "60-day sliding refresh" - Token storage section: add "Refresh token expiry" paragraph explaining sliding window, the schema column, and the self-cleanup behavior - Key Decisions table: add row for the 60-day sliding choice README.md: - "Connecting via OAuth" step 5/6: refresh token is no longer "no expiry"; re-auth happens after a wipe OR >60 days dormant; each use resets CHANGELOG.md: - Add [Unreleased] section with entries for both PR #7 (durability) and PR #8 (sliding expiry, Luxon refactor, JWT tests) The auth-flow mermaid diagram was already accurate (the "24h cycle" note refers to the access token cycle, not the refresh token). Out of scope: ARCHITECTURE.md doesn't yet mention auto-snapshots / protect / retainOnDelete / RECOVERY.md from PR #7. That's a separate cleanup PR since PR #7 is already merged and these are durability-domain edits. https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
aliasunder
pushed a commit
that referenced
this pull request
May 11, 2026
PR #7 added Lightsail auto-snapshot, Pulumi protect + retainOnDelete, and RECOVERY.md but didn't update ARCHITECTURE.md. Bundled here at user's request. - New "Durability" subsection under Infrastructure: four-layer table (removal:retain / protect / retainOnDelete / auto-snapshot) with the what-it-does and where-it-lives columns. Explicit note that the auto-snapshot is the only layer that protects against AWS-side and in-VM failure modes; the IaC seatbelts only cover Pulumi-driven replacement. Points at RECOVERY.md for the actual restore steps. - Cost table: new row for snapshot storage (~$0.50–1.50/mo at typical used-disk size), Phase 1 total 12 → 13/mo, Phase 2 total 26 → 27/mo. - Key Decisions table: rows for the auto-snapshot choice (native primitive over hand-rolled cron+S3) and protect+retainOnDelete (seatbelt over replaceOnChanges gymnastics). https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three layers of defense against the next VM nuke, each catching a different failure class:
/etcedits made over SSH that IaC doesn't see.protect: trueon the Instance — refuses any operation that would destroy or replace the VM. Closes the door that caused both of this week's nukes (key drift, then migration replace).retainOnDelete: true— orphans the AWS resource if SST ever decides to delete (stage rename) rather than destroying.These complement the existing app-level
removal: "retain"— that one only fires onsst remove; these fire on every operation.New
RECOVERY.mddocuments:Cost impact: ~$0.05/mo for snapshot storage at typical usage (~1 GB used, daily incremental deltas).
Out of scope (deliberately)
OAuth refresh-token sliding expiry (60-day window) will land as a separate PR.
Test plan
prettier --check .cleaneslint .cleanvitest run— 151/151 tests passtsccompilesaws lightsail get-auto-snapshots --resource-name vault-cortex-<stage>shows a schedule (within 24h, shows a snapshot)sst deployafter a userData tweak fails with a protected-resource error (revert tweak after verification)RECOVERY.md→ record RTOhttps://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
Generated by Claude Code