fix(gateway): bypass Windows TIME_WAIT in waitForPortFree probe#964
fix(gateway): bypass Windows TIME_WAIT in waitForPortFree probe#964dongsheng123132 wants to merge 1 commit into
Conversation
Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT (~120 seconds). The `waitForPortFree` probe binds with the default `exclusive: true`, which on Windows triggers `SO_EXCLUSIVEADDRUSE`. That option refuses to bind during TIME_WAIT, so the probe falsely reports the port as occupied for the full `timeoutMs` and `startup` aborts. Trigger path observed in customer logs (Windows portable, OpenClaw 4.15+/4.23+, default `gateway.ports = [18789]`): 1. User saves a channel config (QQ Bot, WeChat, Feishu, WhatsApp). 2. Gateway full-restart kicks in, force-kills the existing gateway pid. 3. Old gateway's listen socket lingers in TIME_WAIT for ~120 seconds. 4. New `waitForPortFree(18789)` probe sits in `EADDRINUSE` retry loop for the full 30 s timeout and aborts. 5. Channel runtime gets stuck in `degraded`/`未连接` state until the user reboots Windows or waits ~2 minutes. Customer log signature: ``` Port 18789 still occupied after 30000ms; aborting startup ``` Switching the probe to `exclusive: false`: - On Windows, bypasses `SO_EXCLUSIVEADDRUSE` and lets us see the port as free during TIME_WAIT. - On Linux/macOS, sets `SO_REUSEADDR` (equivalent semantics). - OpenClaw's actual gateway `listen()` is unchanged — this probe's role is only to confirm the port is bindable; the real bind that follows succeeds in practice once the gateway socket is fully released by the kernel after force-kill. Verified end-to-end on Windows portable USB (Clash TUN proxy enabled, which triggers more aggressive OS socket churn): post-fix, port becomes available in <500 ms after force-kill instead of timing out at 30 000 ms.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf66f37252
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // OpenClaw's actual gateway listen() is unaffected — this probe's | ||
| // role is only to confirm the port is bindable; the real bind that | ||
| // follows succeeds whether or not the OS is still in TIME_WAIT. | ||
| server.listen({ port, host: '127.0.0.1', exclusive: false }); |
There was a problem hiding this comment.
Use a real TIME_WAIT workaround in port-free probe
This change is very likely a no-op for the reported Windows restart failure: in Node net.Server.listen, exclusive already defaults to false and only affects cluster handle sharing, so explicitly passing exclusive: false does not change bind behavior for this probe. Because of that, the EADDRINUSE loop during the TIME_WAIT window can still occur exactly as before, so the commit does not actually fix the startup deadlock it targets.
Useful? React with 👍 / 👎.
Summary
Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT (~120 seconds). The probe in waitForPortFree (electron/gateway/supervisor.ts) binds with the default exclusive: true, which on Windows triggers SO_EXCLUSIVEADDRUSE. That option refuses to bind during TIME_WAIT, so the probe falsely reports the port as occupied for the full timeoutMs and aborts startup.
Trigger / repro
Observed on Windows portable installs (also reproducible on stock dev install) running on default gateway.ports = [18789]:
Customer log signature:
```
Port 18789 still occupied after 30000ms; aborting startup
```
The trigger is widespread because every channel config save goes through the full-restart path on the current main, and Clash/V2Ray-TUN setups (very common in mainland China) make the OS socket churn even more aggressive, increasing the chance of hitting this bug on every restart.
Fix
Single-line change: switch the probe socket to exclusive: false.
```diff
```
Verification
End-to-end test on Windows portable build (also with Clash TUN mode enabled, which exposes the bug more often):
Type of Change
Related
Reproduced and tracked in our downstream portable fork (dongsheng123132/clawx-portable@experimental); the gateway-port-deadlock has been blocking customers for several days. This is the minimal change that resolves it; happy to follow up with separate PRs for the deeper related issues (force-kill grace timeout, fallback_restart cause=windows path) if you would prefer them split out.