fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts#966
Conversation
Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT (~120 seconds). The `waitForPortFree` probe binds with the default `exclusive: true`, which on Windows triggers `SO_EXCLUSIVEADDRUSE`. That option refuses to bind during TIME_WAIT, so the probe falsely reports the port as occupied for the full `timeoutMs` and `startup` aborts. Trigger path observed in customer logs (Windows portable, OpenClaw 4.15+/4.23+, default `gateway.ports = [18789]`): 1. User saves a channel config (QQ Bot, WeChat, Feishu, WhatsApp). 2. Gateway full-restart kicks in, force-kills the existing gateway pid. 3. Old gateway's listen socket lingers in TIME_WAIT for ~120 seconds. 4. New `waitForPortFree(18789)` probe sits in `EADDRINUSE` retry loop for the full 30 s timeout and aborts. 5. Channel runtime gets stuck in `degraded`/`未连接` state until the user reboots Windows or waits ~2 minutes. Customer log signature: ``` Port 18789 still occupied after 30000ms; aborting startup ``` Switching the probe to `exclusive: false`: - On Windows, bypasses `SO_EXCLUSIVEADDRUSE` and lets us see the port as free during TIME_WAIT. - On Linux/macOS, sets `SO_REUSEADDR` (equivalent semantics). - OpenClaw's actual gateway `listen()` is unchanged — this probe's role is only to confirm the port is bindable; the real bind that follows succeeds in practice once the gateway socket is fully released by the kernel after force-kill. Verified end-to-end on Windows portable USB (Clash TUN proxy enabled, which triggers more aggressive OS socket churn): post-fix, port becomes available in <500 ms after force-kill instead of timing out at 30 000 ms.
… bump timeouts Builds on the previous TIME_WAIT probe fix (ValueCell-ai#964). Customer logs from Windows portable installs (pc-6545 reproducer) showed the gateway still deadlocking even with `exclusive: false` on the probe socket: 05:58:33 Gateway did not exit in time, force-killing (pid=13092) 05:59:05 Port 18789 still occupied after 30000ms; aborting startup Investigation revealed three root causes the previous PR did not address: 1. Force-kill was fire-and-forget. `terminateOwnedGatewayProcess` kicked off `taskkill /F /PID` then resolved its outer promise immediately on the 5 s timeout, even though the kill itself could still be queued by Windows (security software, ProcessProtection). The next gateway start would race the kill and find the port held by a still-running PID — exactly what we observed (PID 13092 was still LISTENING when we remoted in 5 minutes later). 2. Graceful shutdown window of 5 s was too short. OpenClaw with the dingtalk + feishu + wecom + openclaw-weixin extensions routinely takes 6 - 10 s to close all websocket connections and the http server cleanly. Almost every restart hit the timeout and triggered the force-kill path, which then ran into (1). 3. No post-kill settle. Even a successful taskkill takes a beat before the kernel actually reclaims the socket, so the immediate next probe in `waitForPortFree` saw it as occupied. Changes: - `terminateWindowsProcessTree` now awaits taskkill's callback (was fire-and-forget) and bumps its inner timeout from 5 s to 10 s. - Graceful shutdown timeout: 5 s to 15 s. - After both graceful exit and forced kill, sleep `postKillSettleMs` (2 s on Windows, 500 ms elsewhere) before resolving so the parent doesn't race the kernel's socket cleanup. - `waitForPortFree` default 30 s to 60 s, and at the half-way mark it now attempts a one-shot rescue: identify the LISTENING PID via netstat and force-kill it. Catches the case where an earlier shutdown's force-kill failed entirely (blocked by AV) and an orphaned process is keeping the socket alive. - The `findExistingGatewayProcess` call site bumped its `waitForPortFree` argument to match (30 s to 60 s). This is a targeted patch on top of the existing `exclusive: false` probe fix in ValueCell-ai#964, not a replacement. Both are needed: the probe-side fix allows binding through TIME_WAIT, this PR ensures the listener is actually gone first. Verified: `pnpm typecheck` clean. End-to-end on Windows portable USB build with Clash TUN proxy: previously the gateway would deadlock 1-2 times per session after channel-config saves; post-fix, no deadlock observed in 6 hours of testing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9f1a54dcf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const forkEnv: Record<string, string | undefined> = { | ||
| ...baseEnvPatched, | ||
| ...uvEnv, | ||
| OPENCLAW_EMBEDDED_IN: 'U-Claw', |
There was a problem hiding this comment.
Keep embedded app marker consistent for doctor subprocess
Set OPENCLAW_EMBEDDED_IN to the same app id used by the rest of ClawX launches. This change introduces U-Claw only in runOpenClawDoctorRepair, while other ClawX-managed OpenClaw entry points use ClawX (for example electron/utils/openclaw-cli.ts and resources/cli/*). Running doctor with a different embedded marker can make OpenClaw treat the subprocess as a different host context, causing doctor-repair behavior to diverge from normal ClawX runtime behavior.
Useful? React with 👍 / 👎.
This PR builds on #964 (the `exclusive: false` probe socket fix). It assumes #964 is merged first; otherwise the diff will overlap on the comment block in waitForPortFree. Easy to rebase if you'd prefer to land them together or in a different order — just let me know.
Summary
Even with the `exclusive: false` probe fix from #964, customer logs from Windows portable installs still showed the gateway deadlocking after channel-config saves:
```
05:58:33 Gateway did not exit in time, force-killing (pid=13092)
05:59:05 Port 18789 still occupied after 30000ms; aborting startup
```
When we remoted into the customer machine 5 minutes later, PID 13092 was still LISTENING on 18789. So the probe-side fix wasn't enough — the listener was actually still alive.
Three root causes addressed
1. Force-kill was fire-and-forget
`terminateOwnedGatewayProcess` kicked off `taskkill /F /PID` then resolved its outer promise immediately on the 5 s timeout, even though the kill itself could still be queued by Windows (security software, ProcessProtection). The next gateway start would race the kill and find the port held by a still-running PID.
Fix: `terminateWindowsProcessTree` now awaits taskkill's callback (was fire-and-forget) and bumps its inner timeout from 5 s to 10 s.
2. Graceful shutdown window of 5 s was too short
OpenClaw with the dingtalk + feishu + wecom + openclaw-weixin extensions routinely takes 6–10 s to close all websocket connections and the http server cleanly. Almost every restart hit the 5 s timeout and triggered the force-kill path, which then ran into (1).
Fix: Graceful shutdown timeout: 5 s to 15 s.
3. No post-kill settle
Even a successful taskkill takes a beat before the kernel actually reclaims the socket, so the immediate next probe in `waitForPortFree` saw it as occupied.
Fix: After both graceful exit and forced kill, sleep `postKillSettleMs` (2 s on Windows, 500 ms elsewhere) before resolving so the parent doesn't race the kernel's socket cleanup.
Bonus: rescue path for AV-blocked kills
`waitForPortFree` default 30 s to 60 s, and at the half-way mark it now attempts a one-shot rescue: identify the LISTENING PID via netstat and force-kill it. This catches the case where an earlier shutdown's force-kill failed entirely (blocked by AV) and an orphaned process is keeping the socket alive.
Verification
Type of Change
Pre-Submission Checklist
Related