Skip to content

feat: add resilient agent shim with exponential backoff retry#81

Open
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
hippoley:feat/resilient-agent-shim
Open

feat: add resilient agent shim with exponential backoff retry#81
hippoley wants to merge 1 commit intoYuan-lab-LLM:mainfrom
hippoley:feat/resilient-agent-shim

Conversation

@hippoley
Copy link
Copy Markdown
Contributor

Adds a lightweight Node.js shim that runs inside each openclaw pod and maintains the agent control plane connection. Key features:

  • Automatic registration with ClawManager on pod startup
  • 30-second heartbeat interval with runtime status reporting
  • Error detection: throws on non-2xx responses and success=false
  • Persistent re-registration with exponential backoff (5s → 120s cap)
  • Graceful shutdown on SIGTERM/SIGINT
  • Waits for openclaw gateway health before registering
  • Zero external dependencies (uses built-in fetch + child_process)

The shim is designed to be COPY'd into the openclaw container image and launched by the entrypoint wrapper (clawmanager-agent-entrypoint.sh) when CLAWMANAGER_AGENT_ENABLED=true.

This solves the problem where agents permanently die after transient API errors or session expiry, requiring manual pod restarts to recover.

Adds a lightweight Node.js shim that runs inside each openclaw pod and
maintains the agent control plane connection. Key features:

- Automatic registration with ClawManager on pod startup
- 30-second heartbeat interval with runtime status reporting
- Error detection: throws on non-2xx responses and success=false
- Persistent re-registration with exponential backoff (5s → 120s cap)
- Graceful shutdown on SIGTERM/SIGINT
- Waits for openclaw gateway health before registering
- Zero external dependencies (uses built-in fetch + child_process)

The shim is designed to be COPY'd into the openclaw container image and
launched by the entrypoint wrapper (clawmanager-agent-entrypoint.sh)
when CLAWMANAGER_AGENT_ENABLED=true.

This solves the problem where agents permanently die after transient
API errors or session expiry, requiring manual pod restarts to recover.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant