Skip to content

fix(gateway): bypass Windows TIME_WAIT in waitForPortFree probe#964

Open
dongsheng123132 wants to merge 1 commit into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/gateway-port-deadlock
Open

fix(gateway): bypass Windows TIME_WAIT in waitForPortFree probe#964
dongsheng123132 wants to merge 1 commit into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/gateway-port-deadlock

Conversation

@dongsheng123132

Copy link
Copy Markdown

Summary

Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT (~120 seconds). The probe in waitForPortFree (electron/gateway/supervisor.ts) binds with the default exclusive: true, which on Windows triggers SO_EXCLUSIVEADDRUSE. That option refuses to bind during TIME_WAIT, so the probe falsely reports the port as occupied for the full timeoutMs and aborts startup.

Trigger / repro

Observed on Windows portable installs (also reproducible on stock dev install) running on default gateway.ports = [18789]:

  1. User saves any channel config (QQ Bot, WeChat, Feishu, WhatsApp, etc.).
  2. Gateway full-restart kicks in, force-kills the existing gateway pid.
  3. Old gateway listen socket lingers in TIME_WAIT for ~120 seconds.
  4. New waitForPortFree(18789) probe sits in EADDRINUSE retry loop for the full 30 s timeout and aborts.
  5. Channel runtime gets stuck in degraded / 未连接 until the user either reboots Windows or waits ~2 minutes.

Customer log signature:

```
Port 18789 still occupied after 30000ms; aborting startup
```

The trigger is widespread because every channel config save goes through the full-restart path on the current main, and Clash/V2Ray-TUN setups (very common in mainland China) make the OS socket churn even more aggressive, increasing the chance of hitting this bug on every restart.

Fix

Single-line change: switch the probe socket to exclusive: false.

```diff

  •  server.listen(port, '127.0.0.1');
    
  •  server.listen({ port, host: '127.0.0.1', exclusive: false });
    

```

  • On Windows, bypasses SO_EXCLUSIVEADDRUSE and lets the probe bind during TIME_WAIT.
  • On Linux/macOS, sets SO_REUSEADDR (equivalent semantics, no behavior change in practice since the trigger is Windows-specific).
  • OpenClaw actual gateway listen() is unaffected — this probe role is only to confirm the port is bindable; the real bind that follows succeeds in practice once the gateway socket is fully released by the kernel after force-kill.

Verification

End-to-end test on Windows portable build (also with Clash TUN mode enabled, which exposes the bug more often):

  • Pre-fix: probe times out at 30 s after every channel-config save; channel UI shows 异常降级 / 未连接 until reboot.
  • Post-fix: port becomes bindable within <500 ms after force-kill; channel UI returns to 已连接 within ~30 s of saving config.
  • pnpm typecheck clean.
  • pnpm exec vitest run — no new failures.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Related

Reproduced and tracked in our downstream portable fork (dongsheng123132/clawx-portable@experimental); the gateway-port-deadlock has been blocking customers for several days. This is the minimal change that resolves it; happy to follow up with separate PRs for the deeper related issues (force-kill grace timeout, fallback_restart cause=windows path) if you would prefer them split out.

Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT
(~120 seconds). The `waitForPortFree` probe binds with the default
`exclusive: true`, which on Windows triggers `SO_EXCLUSIVEADDRUSE`.
That option refuses to bind during TIME_WAIT, so the probe falsely
reports the port as occupied for the full `timeoutMs` and `startup`
aborts.

Trigger path observed in customer logs (Windows portable, OpenClaw
4.15+/4.23+, default `gateway.ports = [18789]`):

1. User saves a channel config (QQ Bot, WeChat, Feishu, WhatsApp).
2. Gateway full-restart kicks in, force-kills the existing gateway pid.
3. Old gateway's listen socket lingers in TIME_WAIT for ~120 seconds.
4. New `waitForPortFree(18789)` probe sits in `EADDRINUSE` retry loop
   for the full 30 s timeout and aborts.
5. Channel runtime gets stuck in `degraded`/`未连接` state until the
   user reboots Windows or waits ~2 minutes.

Customer log signature:
```
Port 18789 still occupied after 30000ms; aborting startup
```

Switching the probe to `exclusive: false`:
- On Windows, bypasses `SO_EXCLUSIVEADDRUSE` and lets us see the port
  as free during TIME_WAIT.
- On Linux/macOS, sets `SO_REUSEADDR` (equivalent semantics).
- OpenClaw's actual gateway `listen()` is unchanged — this probe's
  role is only to confirm the port is bindable; the real bind that
  follows succeeds in practice once the gateway socket is fully
  released by the kernel after force-kill.

Verified end-to-end on Windows portable USB (Clash TUN proxy enabled,
which triggers more aggressive OS socket churn): post-fix, port
becomes available in <500 ms after force-kill instead of timing out
at 30 000 ms.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf66f37252

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// OpenClaw's actual gateway listen() is unaffected — this probe's
// role is only to confirm the port is bindable; the real bind that
// follows succeeds whether or not the OS is still in TIME_WAIT.
server.listen({ port, host: '127.0.0.1', exclusive: false });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a real TIME_WAIT workaround in port-free probe

This change is very likely a no-op for the reported Windows restart failure: in Node net.Server.listen, exclusive already defaults to false and only affects cluster handle sharing, so explicitly passing exclusive: false does not change bind behavior for this probe. Because of that, the EADDRINUSE loop during the TIME_WAIT window can still occur exactly as before, so the commit does not actually fix the startup deadlock it targets.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant