Skip to content

feat(gastown): speed up mayor session startup#3122

Merged
kilo-code-bot[bot] merged 7 commits intogastown-stagingfrom
gt/toast/a8b8e704
May 7, 2026
Merged

feat(gastown): speed up mayor session startup#3122
kilo-code-bot[bot] merged 7 commits intogastown-stagingfrom
gt/toast/a8b8e704

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 7, 2026

Summary

Three independent optimizations to speed up mayor session startup and eliminate the "connection timeout on first nav" class of failures:

  1. Prewarm mayor SDK server in bootHydration — After the registry-based agent resume loop, eagerly hydrate the mayor's kilo.db and start its SDK server even if the mayor wasn't in the registry (e.g. it was idle-stopped or torn down after a stream error). This collapses the sdk_ready phase from 2–6 s to <50 ms on warm-restart paths. New GET /api/towns/:townId/mayor-id endpoint and POST /api/towns/:townId/container-events proxy support the container→worker communication.

  2. Seed getMayorStatus cache from ensureMayor response — Instead of just invalidating the query cache (which forces a 3 s poll wait), directly populate the cache with the agentId and sessionStatus from the mutation result. The useXtermPty hook starts attempting PTY connection immediately after the mutation resolves instead of waiting for the next poll tick.

  3. Detect torn-down SDK in _ensureMayor short-circuit — When the container reports the mayor as "running"/"starting" but the SDK has no serverPort or sessionId (torn down after stream errors or drain), fall through to a fresh dispatch instead of returning early. This eliminates the "refresh fixes it" class of failures.

Also includes AE telemetry events for mayor.prewarm_complete, agent.startup_phase, and mayor.ensure_decision (with outcomes: short_circuit_warm, short_circuit_idle, sdk_dead_redispatch, fresh_dispatch).

Verification

  • Manual: Navigate to a town after container restart — mayor terminal connects without timeout
  • Manual: After SDK teardown (stream error), re-navigate to town — page recovers without manual refresh
  • AE events mayor.prewarm_complete, agent.startup_phase, mayor.ensure_decision visible in analytics
  • Integration test for sdkAlive validation logic passes

Visual Changes

N/A

Reviewer Notes

  • Change 1 is the highest-impact but also lowest-risk: prewarmMayorSDK is best-effort (failures logged, never block boot) and runs after the existing registry resume loop completes
  • Change 2 uses trpc.gastown.getMayorStatus.queryKey({ townId }) for the cache key shape, matching the tRPC React Query convention — still invalidates afterward so the next poll catches up to authoritative state
  • Change 3 is the most important for UX: the sdkAlive check validates both serverPort > 0 and truthy sessionId before short-circuiting. checkAgentContainerStatus now surfaces these fields from the container's /agents/:agentId/status response
  • sessionId in getMayorStatus is set to mayor.id (same as agentId), so the cache seeding sessionId: data.agentId is consistent with the authoritative response

John Fawcett added 5 commits May 7, 2026 21:33
Add mayor SDK server prewarming to bootHydration so the mayor's kilo
serve instance is already running when the user's first /agents/start
arrives after a container restart. Previously, the mayor was only
resumed if it was in the registry (running/starting at shutdown), but
idle-stop and stream-error teardowns leave the mayor unregistered.

- Export mayorWorkdirForTown() from agent-runner.ts
- Add prewarmMayorSDK() to process-manager.ts that fetches the mayor
  agent ID from a new worker endpoint, hydrates kilo.db from KV
  snapshot, and starts the SDK server
- Add GET /api/towns/:townId/mayor-id endpoint to gastown.worker.ts
  (uses authMiddleware like container-registry/db-snapshot)
- Add getMayorAgentId() RPC method to Town.do.ts
- Add warm-cache detection in startAgentImpl: log phaseMs: 0 and
  prewarmed: true when the SDK instance was already cached
- bootHydration no longer returns early on empty registry so the
  mayor prewarm always runs
Instead of only invalidating the getMayorStatus query after ensureMayor
succeeds (which forces a 3s polling wait before useXtermPty can start
connecting), seed the React Query cache directly from the mutation
result. The agentId and sessionStatus are already available in the
ensureMayor response, so the terminal can begin connecting within
~50ms instead of waiting for the next poll tick.

Still invalidate after seeding so the next poll catches up to
authoritative state.
When the container reports the mayor as 'running'/'starting' but the
SDK instance has no serverPort or sessionId (torn down after stream
errors or drain), _ensureMayor now falls through to a fresh dispatch
instead of returning early. This eliminates the 'refresh fixes it'
class of failures where the PTY gets a 404 because there's no SDK
port to attach to.

Also extend checkAgentContainerStatus to surface serverPort and
sessionId from the container's agent status response.
Add three Analytics Engine event streams to measure the impact of the
mayor startup optimizations:

1. agent.startup_phase — emitted for db_hydrated, sdk_ready, and
   session_created phases. Includes elapsedMs and phaseMs so we can
   P50/P95 per-town. The sdk_ready event includes phaseMs: 0 when
   the SDK was prewarmed (warm-cache hit).

2. mayor.prewarm_complete — emitted when the mayor SDK server is
   prewarmed during bootHydration, with durationMs.

3. mayor.ensure_decision — tracks the _ensureMayor decision tree:
   short_circuit_warm, short_circuit_idle, sdk_dead_redispatch, or
   fresh_dispatch. Measures the rate of the SDK-dead case that Change
   3 fixes.

Container-side events are proxied to AE via a new
POST /api/towns/:townId/container-events worker endpoint, since the
container can't call writeEvent directly.
Test that _ensureMayor falls through when the container status doesn't
indicate a live SDK (no serverPort or sessionId). Covers:

1. Container not available in test env (baseline behavior)
2. sdkAlive validation logic: zero port, empty session, valid values
3. checkAgentContainerStatus returns 404 for unknown agents
Comment thread services/gastown/container/src/process-manager.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 7, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (1 files)
  • services/gastown/container/src/process-manager.ts

Reviewed by gpt-5.5-2026-04-23 · 983,694 tokens

The prewarm function was copying KILO_TEST_HOME and XDG_DATA_HOME from
process.env, but those are typically absent at the container level. Normal
agent startup sets them per-agent via buildAgentEnv(). Without these, the
prewarmed SDK server boots against the default data directory and bypasses
the hydrated kilo.db snapshot.

Now buildPrewarmEnv() sets KILO_TEST_HOME and XDG_DATA_HOME based on the
mayorAgentId, matching what buildAgentEnv() does for regular agents.
Comment thread services/gastown/container/src/process-manager.ts
Comment thread services/gastown/container/src/process-manager.ts
…config mismatch on cache hit

Prewarm now generates KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT using
buildKiloConfigContent() with the kilocode token and default models
instead of copying them from process.env (where they're absent on cold
start). When ensureSDKServer() finds a cached instance whose config
differs from the incoming env, it evicts the old server and creates a
new one so the SDK picks up the correct model/provider config. Also
extracts PERSIST_ENV_KEYS to module-level and updates process.env for
those keys on cache hit when configs match.
@jrf0110 jrf0110 force-pushed the gt/toast/a8b8e704 branch from d3c8c32 to f839607 Compare May 7, 2026 22:28
@kilo-code-bot kilo-code-bot Bot merged commit 9ca3034 into gastown-staging May 7, 2026
2 checks passed
@kilo-code-bot kilo-code-bot Bot deleted the gt/toast/a8b8e704 branch May 7, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant