Skip to content

feat: Gateway heartbeat endpoint, reduced stale TTL, and optional announce webhook #147

@Jing-yilin

Description

@Jing-yilin

Summary

The AgentWorlds platform is moving to a Gateway-as-single-source-of-truth architecture for world runtime liveness. This requires three protocol-level changes in the Gateway and SDK.

Spec: https://gist.github.com/Jing-yilin/c2777c4b46fe0d52692ec159ba6e5d93 (Phase 2)


1. POST /peer/heartbeat — Lightweight liveness signal

Problem: The only liveness signal is full POST /peer/announce (Ed25519 signed, full payload with identity/endpoints/capabilities). Running it every 30s is protocol-shape overkill — using an expensive registration path as a lease-renewal path.

Solution: Add a lightweight heartbeat endpoint that only refreshes lastSeen:

// gateway/server.mjs
peer.post("/peer/heartbeat", async (req, reply) => {
  const { agentId, ts, signature } = req.body;
  const agent = registry.get(agentId);
  if (!agent?.publicKey) return reply.code(404).send({ error: "Unknown agent" });
  
  if (!verifyWithDomainSeparator(DOMAIN_SEPARATORS.HEARTBEAT, agent.publicKey, { agentId, ts }, signature)) {
    return reply.code(403).send({ error: "Invalid signature" });
  }
  
  agent.lastSeen = Date.now();
  // Do NOT trigger saveRegistry() — memory only
  return { ok: true };
});

SDK changes:

  • Add DOMAIN_SEPARATORS.HEARTBEAT = "aw:hb:" in crypto.ts
  • Add sendHeartbeat() in gateway-announce.ts
  • startGatewayAnnounce(): full announce every 10min (unchanged) + heartbeat every 30s (new)
  • createWorldServer(): automatically starts heartbeat alongside announce

2. Reduce Gateway stale TTL: 15min → 90s

Problem: With Gateway as the sole liveness source, a crashed world stays visible for up to 15 minutes. This is unacceptable for a live directory.

Solution:

const DEFAULT_STALE_TTL_MS = 90 * 1000;  // was: 15 * 60 * 1000

Persistence adjustment: With 90s TTL and 30s heartbeats, lastSeen updates are frequent. Heartbeats should only update in-memory; disk snapshots every 30-60s for crash recovery:

// Heartbeat: memory only (no saveRegistry)
// Announce: triggers saveRegistry (existing behavior)
// New: periodic snapshot every 30s for crash recovery
let _snapshotTimer = setInterval(() => {
  if (registryModified) { writeRegistry(); }
}, 30_000);

3. Optional announce webhook

Problem: When AgentWorlds deploys a world via SSM, the platform needs to know when the world has successfully registered with Gateway. Currently there is no callback mechanism.

Solution: Fire a webhook on first-seen announce (edge-triggered, idempotent):

const WEBHOOK_URL = process.env.WEBHOOK_URL || null;

// In upsertAgent():
if (isFirstSeen && WEBHOOK_URL) {
  fetch(WEBHOOK_URL, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ event: "world.announced", agentId, worldId, ts: Date.now() }),
    signal: AbortSignal.timeout(5000),
  }).catch(() => {});  // best-effort, not blocking
}
  • No WEBHOOK_URL → no webhook fired (local dev friendly)
  • Idempotent: only on first-seen after boot or after TTL expiry
  • Best-effort: fire-and-forget, not a critical path

4. Hand-written gateway/openapi.yaml

Add an OpenAPI 3.1 spec covering the 7 public Gateway endpoints:

  • GET /health
  • GET /worlds
  • GET /world/{worldId}
  • GET /agents
  • POST /peer/announce
  • POST /peer/heartbeat (new)
  • WS /ws (document as info)

This allows AgentWorlds (and other consumers) to generate TypeScript types from the spec instead of hand-writing interfaces that drift out of sync.


Checklist

  • Add DOMAIN_SEPARATORS.HEARTBEAT to SDK crypto.ts
  • Add sendHeartbeat() to SDK gateway-announce.ts
  • Integrate heartbeat into startGatewayAnnounce() (30s interval)
  • Integrate heartbeat into createWorldServer()
  • Add POST /peer/heartbeat to gateway/server.mjs
  • Reduce DEFAULT_STALE_TTL_MS to 90s in gateway/server.mjs
  • Add periodic disk snapshot (30s) in gateway/server.mjs
  • Heartbeat updates memory only, not disk
  • Add optional WEBHOOK_URL env + first-seen webhook in gateway/server.mjs
  • Create gateway/openapi.yaml
  • Tests for heartbeat endpoint
  • Tests for stale TTL pruning at 90s
  • Update gateway/Dockerfile with WEBHOOK_URL env documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions