feat(control-plane): runtime-node attach + heartbeat + version-floor#145
Draft
Sawmonabo wants to merge 7 commits into
Draft
feat(control-plane): runtime-node attach + heartbeat + version-floor#145Sawmonabo wants to merge 7 commits into
Sawmonabo wants to merge 7 commits into
Conversation
Narrow the capabilityupdate request `healthChanges.state` field from the 5-value `NodeState` to the 2-value `RuntimeNodeHealthState` (online|degraded) — the same self-reported-health enum `attach`/`heartbeat` already carry — so all three daemon-self-report surfaces are consistent. The illegal `offline`/`revoked`/ `registering` self-report is now unconstructable at the schema boundary rather than runtime-rejected: `offline` is server-derived liveness-death (the staleness sweep, T3.6), `revoked` is an authority-issued trust decision (detach/admin, T3.7), and `registering -> online` is daemon-declaration-driven (T3.9) — none is daemon-self-reportable (I-003-2 least-privilege). Swap the schema member NodeStateSchema -> RuntimeNodeHealthStateSchema; the single-T input-inference cast stays (mechanism now identical to the heartbeat request schema). Invert the shipped 5-value conformance test to reject, reconcile every comment that called healthChanges.state the 5-value NodeState, and flip the api-payload-contracts.md wire-shape mirror in lockstep so the doc never leads the code. The response `state: NodeState` and the response schema are unchanged — the request-narrow/response-broad asymmetry is intentional (daemon asserts narrow, server reports broad). attach/heartbeat untouched. Refs: Spec-003, Plan-003, ADR-014, ADR-018 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CREATE runtime_node_attachments + runtime_node_presence (Plan-003-owned) as control-plane Postgres migration version 3, reproduced verbatim from shared-postgres-schema.md (state CHECK, composite (node_id, session_id) UNIQUE, partial-active UNIQUE for I-003-5 single-active-session, presence PK). Register v3 in the migration runner's MIGRATIONS array after v2. Co-located 0003-runtime-nodes.test.ts pins column set, both CHECK enums, both unique indexes, presence PK, FK enforcement, and runner idempotency. Forced cross-plan amendment: reconcile 5 applyMigrations-dependent tests (runtime-node-upstream-anchors, migration-shape, 0002-session-invites, presence-register-service, session-directory-service) for migration v3. Refs: Plan-003 Phase 3 T3.1, Spec-003 line 91 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AttachService (Querier-injected) admits a runtime node into a session via an atomic upsert. P1: a NULL min_client_version floor admits every daemon version with readOnly=false, landing the row at registering. P9: a cross-session second-active attach trips the idx_node_attachments_active partial-unique (23505, from either the INSERT or the offline-reactivating DO UPDATE) and is refused with the typed RuntimeNodeAttachConflictException (I-003-5). P10: a re-attach against a terminal revoked row updates zero rows and is refused with RuntimeNodeAttachRevokedException; an offline row is reactivated instead. Adds two registry-only wire codes (runtimenode.attach_conflict, runtimenode.attach_revoked) to @ai-sidekicks/contracts and error-contracts.md (HTTP 409, code+message only, no details, avoiding cross-session info-leak). The tRPC catch-arm + errorFormatter that project these to the 409 envelope are deferred to T3.4/T3.8 (the service throws at the service boundary). Preserves I-003-3 (attach never mutates session_memberships, asserted by both byte-identity and row-count) and I-003-5 (single active attachment, enforced by the partial-unique index, not application TOCTOU). Refs: Plan-003, Spec-003 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HeartbeatService (Querier-injected, mirroring AttachService) owns the runtime-node liveness axis on runtime_node_presence, distinct from the attachment-slot axis that attach/detach own. ingest upserts the presence row on the server clock with the daemon's 2-value self-report (online|degraded). A sweep-demoted node that resumes heartbeating is restored to online without passing through offline (P6 hysteresis recovery). sweepStaleness is the server-derived demotion: one idempotent, transition-only UPDATE...RETURNING that drives rows stale past 30s to degraded and past 60s to offline (Spec-003 lines 59-61). It writes ONLY the coordination record and emits no durable runtime_node.* event; those are V1.1-gated (ADR-017). STALENESS_SWEEP_INTERVAL_MS is 5s, finer than the 15s cadence, bounding detection lag to one sweep interval. The periodic scheduler and tRPC router are deferred to T3.8. 14 tests cover ingest create/update, offline self-report rejection, the degraded/offline demotions including the degraded->offline progression, hysteresis and offline->online recovery, multi-node multiplicity, sweep idempotency, and the presence-only write boundary. Refs: ADR-017, Plan-003, Spec-003 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire AttachService.#deriveReadOnly to compare the daemon's clientVersion against the session's min_client_version floor. A daemon at or above the floor is admitted read-write (P2); a daemon below the floor is admitted read-only (P3) — it remains joined and reads succeed, never ejected (I-003-1 / ADR-018 Decision #4). The VERSION_FLOOR_EXCEEDED write refusal on a read-only daemon's subsequent write is T3.4's. Author compareEventEnvelopeVersion in packages/contracts/src/event.ts: a hand-rolled numeric MAJOR.MINOR tuple compare (not semver), the canonical total ordering of the EventEnvelopeVersion value type. It lives in contracts because both the control-plane floor gate and the daemon's envelope negotiation (ADR-018 Decision #1) compare these values, and contracts is their only shared ancestor — a consumer-local helper would re-introduce the lexical "10" < "9" ordering bug. The raw DB floor is parsed+branded at the service boundary (never an as-cast), so a malformed floor throws at parse time instead of reaching the comparator as NaN. Refs: ADR-018, Plan-003, Spec-003 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a P5 characterization block to the AttachService suite: two distinct runtime nodes attach to one session as co-active rows, the sessions row stays byte-for-byte unchanged, and no new session is created — multi-node coexistence without changing session identity. The shipped T3.2 path already satisfies P5 (the (node_id, session_id) conflict arbiter + the per-node active index admit multi-node-per-session, and attach never writes sessions), so this is test-only; no production change. Includes the multi-node I-003-3 complement (session_memberships untouched). Refs: Plan-003, Spec-003 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Plan-003 Phase 3 — Control-Plane Attach + Heartbeat Services + Version-Floor Enforcement. Brings tests P1–P10 green:
capabilityupdate.healthChanges.statenarrowed to the 2-valueRuntimeNodeHealthState(online | degraded) daemon self-report enum;offline/revokedbecome unconstructable (least-privilege, I-003-2).runtime_node_attachments+runtime_node_presence.VERSION_FLOOR_EXCEEDEDon below-floor writes (never ejected); single-active-session + reconnect reactivation; revocation terminal; multi-node coexistence without changing session identity.degraded(30s)/offline(60s) staleness sweep recorded as coordination-record updates, not durable events (ADR-017 §Server-Derived Runtime-Node Lifecycle Events).runtimeNodeRouter(attach/heartbeat/capabilityupdate/detach) composed onto the Plan-008-bootstrap tRPC host.session_memberships(T3.7).Task DAG
Test plan
VERSION_FLOOR_EXCEEDED; node not detacheddegraded(30s)→offline(60s) as coordination-record transitions (no durable event in V1)session_membershipssession_membershipsunchangedofflinerowrevokedrow is refused (revocation terminal)Review Notes
CONFLICT409) at L1. This is structurally forced — T3.2 cannot depend on the catch-arm without cycling (T3.4 → T3.3 → T3.2). T3.2 therefore verifies at the service boundary:AttachService.attachthrows the typed exception on cross-session-active (Postgres 23505) / revoked re-attach, and reactivates theofflinerow on reconnect. The "→ 409 envelope" assertion is deferred to T3.4 (where the catch-arm lands). T3.2 MUST NOT createruntime-node-router.factory.ts(T3.8 owns it). T3.3 P2/P3 and T3.4 P4 are already service-level / end-to-end respectively, so this only applies to T3.2.ADR-017, ADR-018, Plan-003, Spec-003.applyMigrationsforces amendments to 5 test files beyond the new migration (blast radius empirically confirmed CONTAINED to control-plane). All 5 are folded into T3.1 as one atomic commit (Plan-002 Amendment 2 fix-in-place precedent). Why a Plan-003 task touches Plan-001/002 files: (a)migration-runner.test.ts/session-directory-service.test.ts— version-count + anchor-array bumps (new registered version). (b)0002-session-invites.test.tsT7 — idempotency test brought fully to v1+v2+v3 so the "re-call is a no-op" assertion stays true (not a hollow2→3bump). (c)presence-register-service.test.ts(tests 1&2) +migration-shape.test.ts(test 3 only) — these I-002-3 guards broke only because they usedapplyMigrationsas a v2-shortcut that now also applies v3; the fix restores their documentedv1→v2scoping (migration-shape.test.ts:27-29; presence test-2 comment) rather than mutating a Plan-002 invariant assertion from a Plan-003 task. The full-schema I-002-3 carve-out is homed in the Plan-003-ownedruntime-node-upstream-anchors.test.ts(its header note (d) prescribes exactly this), where new assertion (4) pins that the only durable presence-named table isruntime_node_presence— runtime-node liveness, a distinct domain from the collaborative Yjs-Awareness presence I-002-3 governs (cites Spec-003 + ADR-017 + shared-postgres-schema.md). I-002-3's teeth are preserved at full-schema scope, in-lane. Each edited Plan-001/002 file carries an in-file amendment note naming this PR.runtimenode.attach_conflict(P9, transient) +runtimenode.attach_revoked(P10, terminal) — incontracts/src/error.ts+error-contracts.md(new §Runtime Node table, HTTP 409). Nodetailsshape by design: the registry-only convention (8 of the existing 409 codes are code+message), no AC needs structured details, and a conflicting-session-id detail would leak a session the caller may not access. The currentSessionRouterAisErrorenvelope (trpc.ts:49-53) typesdetailsas required (ResourceLimitExceededDetails) only because resource-limit is the sole projected error today. T3.4/T3.8 obligation (carry forward — do not lose): when wiring the runtime-node router catch-arm + reusing the sharedterrorFormatter, (a) evolve the envelope to accept code+message-only errors (makedetailsoptional, or give the runtime-node router its own envelope), and (b) the formatter then matches 4 typed exceptions (resource-limit + version-floor + the 2 attach-conflict) — at/above thetrpc.ts:27-29+sessions/errors.ts:14-16documented "3+ branches →AisWireExceptionbase class" refactor trigger, so do the base-class refactor at that point. T3.2 itself stays at the service boundary (throwables only; no formatter wiring).compareEventEnvelopeVersion(a, b): -1 | 0 | 1incontracts/src/event.ts, co-located with theEventEnvelopeVersionvalue type it orders — completing the type's API, not speculative generalization. Placement = dependency direction: the comparator is consumed in-PR by T3.3 (below-floor read-only verdict) and re-derived by T3.4 (below-floor write-refusal — there is no persistedread_onlycolumn, so the verdict is recomputed at write time from the attachmentclient_version+ the session floor), and per ADR-018 §6/§7/§10 the daemon also ordersEventEnvelopeVersionvalues for envelope negotiation + upcaster keying. Contracts is the only shared ancestor of both packages; a control-plane-local helper would force the daemon to depend upward into control-plane or re-implement the compare (the lexical"10" < "9"bug, twice). Hand-rolled, not thesemverlib: the type is strictly 2-segment MAJOR.MINOR (EVENT_ENVELOPE_VERSION_PATTERN), so a numeric tuple compare is trivially correct whilesemverneeds.coerce()padding, a new control-plane dependency, and irrelevant patch/prerelease/range semantics. Floor-bypass guard: the comparator takes brand-validated inputs (the brand is the proof of well-formedness); the only unbranded input is the DB floor (min_client_version), read throughEventEnvelopeVersionSchema.parse(floor)— never anascast — so a malformed floor throws a loud data-integrity error instead ofsplit(".").map(Number)yieldingNaN(everyNaNcomparison is false → silent admit to read-write). A malformed-floor test pins that.parsethrows. Tests: the comparator's unit tests co-locate insession-event.test.ts(whereEventEnvelopeVersionis already tested), andattach-service.test.tsline 524's T3.2 placeholder (below-floor →readOnly=false) is deliberately flipped toreadOnly=true(the T3.2 author staged it as a visible behavior change, not a silent regression).Refs: ADR-017, Plan-003, Spec-003
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com