feat(gastown): speed up mayor session startup#3122
Merged
kilo-code-bot[bot] merged 7 commits intogastown-stagingfrom May 7, 2026
Merged
feat(gastown): speed up mayor session startup#3122kilo-code-bot[bot] merged 7 commits intogastown-stagingfrom
kilo-code-bot[bot] merged 7 commits intogastown-stagingfrom
Conversation
added 5 commits
May 7, 2026 21:33
Add mayor SDK server prewarming to bootHydration so the mayor's kilo serve instance is already running when the user's first /agents/start arrives after a container restart. Previously, the mayor was only resumed if it was in the registry (running/starting at shutdown), but idle-stop and stream-error teardowns leave the mayor unregistered. - Export mayorWorkdirForTown() from agent-runner.ts - Add prewarmMayorSDK() to process-manager.ts that fetches the mayor agent ID from a new worker endpoint, hydrates kilo.db from KV snapshot, and starts the SDK server - Add GET /api/towns/:townId/mayor-id endpoint to gastown.worker.ts (uses authMiddleware like container-registry/db-snapshot) - Add getMayorAgentId() RPC method to Town.do.ts - Add warm-cache detection in startAgentImpl: log phaseMs: 0 and prewarmed: true when the SDK instance was already cached - bootHydration no longer returns early on empty registry so the mayor prewarm always runs
Instead of only invalidating the getMayorStatus query after ensureMayor succeeds (which forces a 3s polling wait before useXtermPty can start connecting), seed the React Query cache directly from the mutation result. The agentId and sessionStatus are already available in the ensureMayor response, so the terminal can begin connecting within ~50ms instead of waiting for the next poll tick. Still invalidate after seeding so the next poll catches up to authoritative state.
When the container reports the mayor as 'running'/'starting' but the SDK instance has no serverPort or sessionId (torn down after stream errors or drain), _ensureMayor now falls through to a fresh dispatch instead of returning early. This eliminates the 'refresh fixes it' class of failures where the PTY gets a 404 because there's no SDK port to attach to. Also extend checkAgentContainerStatus to surface serverPort and sessionId from the container's agent status response.
Add three Analytics Engine event streams to measure the impact of the mayor startup optimizations: 1. agent.startup_phase — emitted for db_hydrated, sdk_ready, and session_created phases. Includes elapsedMs and phaseMs so we can P50/P95 per-town. The sdk_ready event includes phaseMs: 0 when the SDK was prewarmed (warm-cache hit). 2. mayor.prewarm_complete — emitted when the mayor SDK server is prewarmed during bootHydration, with durationMs. 3. mayor.ensure_decision — tracks the _ensureMayor decision tree: short_circuit_warm, short_circuit_idle, sdk_dead_redispatch, or fresh_dispatch. Measures the rate of the SDK-dead case that Change 3 fixes. Container-side events are proxied to AE via a new POST /api/towns/:townId/container-events worker endpoint, since the container can't call writeEvent directly.
Test that _ensureMayor falls through when the container status doesn't indicate a live SDK (no serverPort or sessionId). Covers: 1. Container not available in test env (baseline behavior) 2. sdkAlive validation logic: zero port, empty session, valid values 3. checkAgentContainerStatus returns 404 for unknown agents
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (1 files)
Reviewed by gpt-5.5-2026-04-23 · 983,694 tokens |
The prewarm function was copying KILO_TEST_HOME and XDG_DATA_HOME from process.env, but those are typically absent at the container level. Normal agent startup sets them per-agent via buildAgentEnv(). Without these, the prewarmed SDK server boots against the default data directory and bypasses the hydrated kilo.db snapshot. Now buildPrewarmEnv() sets KILO_TEST_HOME and XDG_DATA_HOME based on the mayorAgentId, matching what buildAgentEnv() does for regular agents.
…config mismatch on cache hit Prewarm now generates KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT using buildKiloConfigContent() with the kilocode token and default models instead of copying them from process.env (where they're absent on cold start). When ensureSDKServer() finds a cached instance whose config differs from the incoming env, it evicts the old server and creates a new one so the SDK picks up the correct model/provider config. Also extracts PERSIST_ENV_KEYS to module-level and updates process.env for those keys on cache hit when configs match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent optimizations to speed up mayor session startup and eliminate the "connection timeout on first nav" class of failures:
Prewarm mayor SDK server in
bootHydration— After the registry-based agent resume loop, eagerly hydrate the mayor's kilo.db and start its SDK server even if the mayor wasn't in the registry (e.g. it was idle-stopped or torn down after a stream error). This collapses thesdk_readyphase from 2–6 s to <50 ms on warm-restart paths. NewGET /api/towns/:townId/mayor-idendpoint andPOST /api/towns/:townId/container-eventsproxy support the container→worker communication.Seed
getMayorStatuscache fromensureMayorresponse — Instead of just invalidating the query cache (which forces a 3 s poll wait), directly populate the cache with theagentIdandsessionStatusfrom the mutation result. TheuseXtermPtyhook starts attempting PTY connection immediately after the mutation resolves instead of waiting for the next poll tick.Detect torn-down SDK in
_ensureMayorshort-circuit — When the container reports the mayor as "running"/"starting" but the SDK has no serverPort or sessionId (torn down after stream errors or drain), fall through to a fresh dispatch instead of returning early. This eliminates the "refresh fixes it" class of failures.Also includes AE telemetry events for
mayor.prewarm_complete,agent.startup_phase, andmayor.ensure_decision(with outcomes:short_circuit_warm,short_circuit_idle,sdk_dead_redispatch,fresh_dispatch).Verification
mayor.prewarm_complete,agent.startup_phase,mayor.ensure_decisionvisible in analyticsVisual Changes
N/A
Reviewer Notes
prewarmMayorSDKis best-effort (failures logged, never block boot) and runs after the existing registry resume loop completestrpc.gastown.getMayorStatus.queryKey({ townId })for the cache key shape, matching the tRPC React Query convention — still invalidates afterward so the next poll catches up to authoritative statesdkAlivecheck validates bothserverPort > 0and truthysessionIdbefore short-circuiting.checkAgentContainerStatusnow surfaces these fields from the container's/agents/:agentId/statusresponsesessionIdingetMayorStatusis set tomayor.id(same asagentId), so the cache seedingsessionId: data.agentIdis consistent with the authoritative response