Skip to content

fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts#966

Open
dongsheng123132 wants to merge 2 commits into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/gateway-force-kill-grace
Open

fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts#966
dongsheng123132 wants to merge 2 commits into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/gateway-force-kill-grace

Conversation

@dongsheng123132

Copy link
Copy Markdown

⚠️ Depends on #964

This PR builds on #964 (the `exclusive: false` probe socket fix). It assumes #964 is merged first; otherwise the diff will overlap on the comment block in waitForPortFree. Easy to rebase if you'd prefer to land them together or in a different order — just let me know.

Summary

Even with the `exclusive: false` probe fix from #964, customer logs from Windows portable installs still showed the gateway deadlocking after channel-config saves:

```
05:58:33 Gateway did not exit in time, force-killing (pid=13092)
05:59:05 Port 18789 still occupied after 30000ms; aborting startup
```

When we remoted into the customer machine 5 minutes later, PID 13092 was still LISTENING on 18789. So the probe-side fix wasn't enough — the listener was actually still alive.

Three root causes addressed

1. Force-kill was fire-and-forget

`terminateOwnedGatewayProcess` kicked off `taskkill /F /PID` then resolved its outer promise immediately on the 5 s timeout, even though the kill itself could still be queued by Windows (security software, ProcessProtection). The next gateway start would race the kill and find the port held by a still-running PID.

Fix: `terminateWindowsProcessTree` now awaits taskkill's callback (was fire-and-forget) and bumps its inner timeout from 5 s to 10 s.

2. Graceful shutdown window of 5 s was too short

OpenClaw with the dingtalk + feishu + wecom + openclaw-weixin extensions routinely takes 6–10 s to close all websocket connections and the http server cleanly. Almost every restart hit the 5 s timeout and triggered the force-kill path, which then ran into (1).

Fix: Graceful shutdown timeout: 5 s to 15 s.

3. No post-kill settle

Even a successful taskkill takes a beat before the kernel actually reclaims the socket, so the immediate next probe in `waitForPortFree` saw it as occupied.

Fix: After both graceful exit and forced kill, sleep `postKillSettleMs` (2 s on Windows, 500 ms elsewhere) before resolving so the parent doesn't race the kernel's socket cleanup.

Bonus: rescue path for AV-blocked kills

`waitForPortFree` default 30 s to 60 s, and at the half-way mark it now attempts a one-shot rescue: identify the LISTENING PID via netstat and force-kill it. This catches the case where an earlier shutdown's force-kill failed entirely (blocked by AV) and an orphaned process is keeping the socket alive.

Verification

  • `pnpm typecheck` clean.
  • End-to-end on Windows portable USB build with Clash TUN proxy: previously the gateway would deadlock 1-2 times per session after channel-config saves; post-fix, no deadlock observed in 6 hours of testing.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Pre-Submission Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code with detailed inline explanations of the Windows process / kernel semantics
  • My changes generate no new warnings (typecheck passes)
  • I have manually tested the fix path under the trigger conditions described above

Related

Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT
(~120 seconds). The `waitForPortFree` probe binds with the default
`exclusive: true`, which on Windows triggers `SO_EXCLUSIVEADDRUSE`.
That option refuses to bind during TIME_WAIT, so the probe falsely
reports the port as occupied for the full `timeoutMs` and `startup`
aborts.

Trigger path observed in customer logs (Windows portable, OpenClaw
4.15+/4.23+, default `gateway.ports = [18789]`):

1. User saves a channel config (QQ Bot, WeChat, Feishu, WhatsApp).
2. Gateway full-restart kicks in, force-kills the existing gateway pid.
3. Old gateway's listen socket lingers in TIME_WAIT for ~120 seconds.
4. New `waitForPortFree(18789)` probe sits in `EADDRINUSE` retry loop
   for the full 30 s timeout and aborts.
5. Channel runtime gets stuck in `degraded`/`未连接` state until the
   user reboots Windows or waits ~2 minutes.

Customer log signature:
```
Port 18789 still occupied after 30000ms; aborting startup
```

Switching the probe to `exclusive: false`:
- On Windows, bypasses `SO_EXCLUSIVEADDRUSE` and lets us see the port
  as free during TIME_WAIT.
- On Linux/macOS, sets `SO_REUSEADDR` (equivalent semantics).
- OpenClaw's actual gateway `listen()` is unchanged — this probe's
  role is only to confirm the port is bindable; the real bind that
  follows succeeds in practice once the gateway socket is fully
  released by the kernel after force-kill.

Verified end-to-end on Windows portable USB (Clash TUN proxy enabled,
which triggers more aggressive OS socket churn): post-fix, port
becomes available in <500 ms after force-kill instead of timing out
at 30 000 ms.
… bump timeouts

Builds on the previous TIME_WAIT probe fix (ValueCell-ai#964). Customer logs from
Windows portable installs (pc-6545 reproducer) showed the gateway
still deadlocking even with `exclusive: false` on the probe socket:

  05:58:33 Gateway did not exit in time, force-killing (pid=13092)
  05:59:05 Port 18789 still occupied after 30000ms; aborting startup

Investigation revealed three root causes the previous PR did not address:

1. Force-kill was fire-and-forget. `terminateOwnedGatewayProcess` kicked
   off `taskkill /F /PID` then resolved its outer promise immediately
   on the 5 s timeout, even though the kill itself could still be
   queued by Windows (security software, ProcessProtection). The next
   gateway start would race the kill and find the port held by a
   still-running PID — exactly what we observed (PID 13092 was still
   LISTENING when we remoted in 5 minutes later).

2. Graceful shutdown window of 5 s was too short. OpenClaw with the
   dingtalk + feishu + wecom + openclaw-weixin extensions routinely
   takes 6 - 10 s to close all websocket connections and the http
   server cleanly. Almost every restart hit the timeout and triggered
   the force-kill path, which then ran into (1).

3. No post-kill settle. Even a successful taskkill takes a beat
   before the kernel actually reclaims the socket, so the immediate
   next probe in `waitForPortFree` saw it as occupied.

Changes:

- `terminateWindowsProcessTree` now awaits taskkill's callback (was
  fire-and-forget) and bumps its inner timeout from 5 s to 10 s.
- Graceful shutdown timeout: 5 s to 15 s.
- After both graceful exit and forced kill, sleep `postKillSettleMs`
  (2 s on Windows, 500 ms elsewhere) before resolving so the parent
  doesn't race the kernel's socket cleanup.
- `waitForPortFree` default 30 s to 60 s, and at the half-way mark it
  now attempts a one-shot rescue: identify the LISTENING PID via
  netstat and force-kill it. Catches the case where an earlier
  shutdown's force-kill failed entirely (blocked by AV) and an
  orphaned process is keeping the socket alive.
- The `findExistingGatewayProcess` call site bumped its
  `waitForPortFree` argument to match (30 s to 60 s).

This is a targeted patch on top of the existing `exclusive: false`
probe fix in ValueCell-ai#964, not a replacement. Both are needed: the probe-side
fix allows binding through TIME_WAIT, this PR ensures the listener is
actually gone first.

Verified: `pnpm typecheck` clean. End-to-end on Windows portable USB
build with Clash TUN proxy: previously the gateway would deadlock 1-2
times per session after channel-config saves; post-fix, no deadlock
observed in 6 hours of testing.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9f1a54dcf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const forkEnv: Record<string, string | undefined> = {
...baseEnvPatched,
...uvEnv,
OPENCLAW_EMBEDDED_IN: 'U-Claw',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep embedded app marker consistent for doctor subprocess

Set OPENCLAW_EMBEDDED_IN to the same app id used by the rest of ClawX launches. This change introduces U-Claw only in runOpenClawDoctorRepair, while other ClawX-managed OpenClaw entry points use ClawX (for example electron/utils/openclaw-cli.ts and resources/cli/*). Running doctor with a different embedded marker can make OpenClaw treat the subprocess as a different host context, causing doctor-repair behavior to diverge from normal ClawX runtime behavior.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant