Skip to content

feat(sst): Lightsail durability — auto-snapshot + Pulumi protect#7

Merged
aliasunder merged 1 commit into
mainfrom
claude/lightsail-durability-options-pTSgQ
May 11, 2026
Merged

feat(sst): Lightsail durability — auto-snapshot + Pulumi protect#7
aliasunder merged 1 commit into
mainfrom
claude/lightsail-durability-options-pTSgQ

Conversation

@aliasunder

Copy link
Copy Markdown
Owner

Summary

Three layers of defense against the next VM nuke, each catching a different failure class:

  • Daily Lightsail auto-snapshot at 03:00 UTC, 7-day rolling retention. Full disk image — captures ad-hoc apt installs and /etc edits made over SSH that IaC doesn't see.
  • Pulumi protect: true on the Instance — refuses any operation that would destroy or replace the VM. Closes the door that caused both of this week's nukes (key drift, then migration replace).
  • Pulumi retainOnDelete: true — orphans the AWS resource if SST ever decides to delete (stage rename) rather than destroying.

These complement the existing app-level removal: "retain" — that one only fires on sst remove; these fire on every operation.

New RECOVERY.md documents:

  • Three restore scenarios (VM alive, snapshot fresh, snapshot aged out)
  • The intentional-replace flow (unprotect → deploy → re-protect, for Phase 2 bundle upgrade etc.)
  • SST state reconciliation paths after a restore
  • Auth implications after any restore (JWTs survive 24h; refresh tokens carry over if snapshot fresh)

Cost impact: ~$0.05/mo for snapshot storage at typical usage (~1 GB used, daily incremental deltas).

Out of scope (deliberately)

  • OAuth DB backup. Re-auth across MCP clients is accepted as a minor inconvenience.
  • Pre-deploy CI snapshot step. Daily auto-snapshots cover the RPO window adequately.
  • Attached block storage. Snapshots already capture the whole boot disk.

OAuth refresh-token sliding expiry (60-day window) will land as a separate PR.

Test plan

  • prettier --check . clean
  • eslint . clean
  • vitest run — 151/151 tests pass
  • tsc compiles
  • Husky pre-commit hook passed on staged files
  • After merge + deploy: aws lightsail get-auto-snapshots --resource-name vault-cortex-<stage> shows a schedule (within 24h, shows a snapshot)
  • Confirm sst deploy after a userData tweak fails with a protected-resource error (revert tweak after verification)
  • One-time restore drill from RECOVERY.md → record RTO

https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C


Generated by Claude Code

Layers stacked, in order of intervention:

- Daily Lightsail auto-snapshot at 03:00 UTC (23:00 ET), 7-day rolling
  retention. Captures the full boot disk, including ad-hoc apt installs
  and /etc edits made over SSH — things IaC doesn't see.
- Pulumi `protect: true` on the Instance refuses any operation that
  would destroy or replace the VM. Closes the door that caused two
  unwanted nukes this week (key-drift, then the migration replace).
- Pulumi `retainOnDelete: true` orphans the AWS resource if SST ever
  does decide to delete (stage rename, etc.) rather than destroying.

These complement the existing app-level `removal: "retain"` — that one
only fires on `sst remove`, these fire on every operation. SST state
file overlap with `protect` is intentional defense in depth.

New RECOVERY.md at repo root documents the three restore scenarios
(VM alive, VM gone but snapshot fresh, VM gone and snapshot aged
out), the intentional-replace flow (unprotect → deploy → re-protect),
the SST state reconciliation paths, and the auth implications after
any restore.

Cost impact: ~$0.05/mo for snapshot storage at typical usage (~1 GB
used, daily incremental deltas).

Out of scope (deliberately): OAuth DB backup, pre-deploy CI snapshot
step, attached block storage. Re-auth across MCP clients is accepted
as a minor inconvenience.

https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
@aliasunder aliasunder merged commit 5fa030c into main May 11, 2026
1 check passed
aliasunder pushed a commit that referenced this pull request May 11, 2026
… entries

ARCHITECTURE.md:
- Auth method table: "no-expiry refresh" → "60-day sliding refresh"
- Token storage section: add "Refresh token expiry" paragraph explaining
  sliding window, the schema column, and the self-cleanup behavior
- Key Decisions table: add row for the 60-day sliding choice

README.md:
- "Connecting via OAuth" step 5/6: refresh token is no longer "no expiry";
  re-auth happens after a wipe OR >60 days dormant; each use resets

CHANGELOG.md:
- Add [Unreleased] section with entries for both PR #7 (durability) and
  PR #8 (sliding expiry, Luxon refactor, JWT tests)

The auth-flow mermaid diagram was already accurate (the "24h cycle" note
refers to the access token cycle, not the refresh token).

Out of scope: ARCHITECTURE.md doesn't yet mention auto-snapshots / protect
/ retainOnDelete / RECOVERY.md from PR #7. That's a separate cleanup PR
since PR #7 is already merged and these are durability-domain edits.

https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
aliasunder pushed a commit that referenced this pull request May 11, 2026
PR #7 added Lightsail auto-snapshot, Pulumi protect + retainOnDelete,
and RECOVERY.md but didn't update ARCHITECTURE.md. Bundled here at
user's request.

- New "Durability" subsection under Infrastructure: four-layer table
  (removal:retain / protect / retainOnDelete / auto-snapshot) with the
  what-it-does and where-it-lives columns. Explicit note that the
  auto-snapshot is the only layer that protects against AWS-side and
  in-VM failure modes; the IaC seatbelts only cover Pulumi-driven
  replacement. Points at RECOVERY.md for the actual restore steps.
- Cost table: new row for snapshot storage (~$0.50–1.50/mo at typical
  used-disk size), Phase 1 total 12 → 13/mo, Phase 2 total 26 → 27/mo.
- Key Decisions table: rows for the auto-snapshot choice (native
  primitive over hand-rolled cron+S3) and protect+retainOnDelete
  (seatbelt over replaceOnChanges gymnastics).

https://claude.ai/code/session_012VFVHWsKEQLJFE3SNguq9C
@aliasunder aliasunder deleted the claude/lightsail-durability-options-pTSgQ branch May 19, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants