fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts by dongsheng123132 · Pull Request #966 · ValueCell-ai/ClawX

dongsheng123132 · 2026-05-02T06:30:02Z

⚠️ Depends on #964

This PR builds on #964 (the `exclusive: false` probe socket fix). It assumes #964 is merged first; otherwise the diff will overlap on the comment block in waitForPortFree. Easy to rebase if you'd prefer to land them together or in a different order — just let me know.

Summary

Even with the `exclusive: false` probe fix from #964, customer logs from Windows portable installs still showed the gateway deadlocking after channel-config saves:

```
05:58:33 Gateway did not exit in time, force-killing (pid=13092)
05:59:05 Port 18789 still occupied after 30000ms; aborting startup
```

When we remoted into the customer machine 5 minutes later, PID 13092 was still LISTENING on 18789. So the probe-side fix wasn't enough — the listener was actually still alive.

Three root causes addressed

1. Force-kill was fire-and-forget

`terminateOwnedGatewayProcess` kicked off `taskkill /F /PID` then resolved its outer promise immediately on the 5 s timeout, even though the kill itself could still be queued by Windows (security software, ProcessProtection). The next gateway start would race the kill and find the port held by a still-running PID.

Fix: `terminateWindowsProcessTree` now awaits taskkill's callback (was fire-and-forget) and bumps its inner timeout from 5 s to 10 s.

2. Graceful shutdown window of 5 s was too short

OpenClaw with the dingtalk + feishu + wecom + openclaw-weixin extensions routinely takes 6–10 s to close all websocket connections and the http server cleanly. Almost every restart hit the 5 s timeout and triggered the force-kill path, which then ran into (1).

Fix: Graceful shutdown timeout: 5 s to 15 s.

3. No post-kill settle

Even a successful taskkill takes a beat before the kernel actually reclaims the socket, so the immediate next probe in `waitForPortFree` saw it as occupied.

Fix: After both graceful exit and forced kill, sleep `postKillSettleMs` (2 s on Windows, 500 ms elsewhere) before resolving so the parent doesn't race the kernel's socket cleanup.

Bonus: rescue path for AV-blocked kills

`waitForPortFree` default 30 s to 60 s, and at the half-way mark it now attempts a one-shot rescue: identify the LISTENING PID via netstat and force-kill it. This catches the case where an earlier shutdown's force-kill failed entirely (blocked by AV) and an orphaned process is keeping the socket alive.

Verification

`pnpm typecheck` clean.
End-to-end on Windows portable USB build with Clash TUN proxy: previously the gateway would deadlock 1-2 times per session after channel-config saves; post-fix, no deadlock observed in 6 hours of testing.

Type of Change

Bug fix (non-breaking change which fixes an issue)

Pre-Submission Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code with detailed inline explanations of the Windows process / kernel semantics
My changes generate no new warnings (typecheck passes)
I have manually tested the fix path under the trigger conditions described above

Builds on fix(gateway): bypass Windows TIME_WAIT in waitForPortFree probe #964 (probe-side TIME_WAIT fix)
Companion to fix(config): atomic JSON writes to prevent half-read races during startup #965 (atomic JSON writes)
All three discovered while running our downstream portable fork (`dongsheng123132/clawx-portable@experimental`).

Force-killing the gateway leaves port 18789 in Windows TCP TIME_WAIT (~120 seconds). The `waitForPortFree` probe binds with the default `exclusive: true`, which on Windows triggers `SO_EXCLUSIVEADDRUSE`. That option refuses to bind during TIME_WAIT, so the probe falsely reports the port as occupied for the full `timeoutMs` and `startup` aborts. Trigger path observed in customer logs (Windows portable, OpenClaw 4.15+/4.23+, default `gateway.ports = [18789]`): 1. User saves a channel config (QQ Bot, WeChat, Feishu, WhatsApp). 2. Gateway full-restart kicks in, force-kills the existing gateway pid. 3. Old gateway's listen socket lingers in TIME_WAIT for ~120 seconds. 4. New `waitForPortFree(18789)` probe sits in `EADDRINUSE` retry loop for the full 30 s timeout and aborts. 5. Channel runtime gets stuck in `degraded`/`未连接` state until the user reboots Windows or waits ~2 minutes. Customer log signature: ``` Port 18789 still occupied after 30000ms; aborting startup ``` Switching the probe to `exclusive: false`: - On Windows, bypasses `SO_EXCLUSIVEADDRUSE` and lets us see the port as free during TIME_WAIT. - On Linux/macOS, sets `SO_REUSEADDR` (equivalent semantics). - OpenClaw's actual gateway `listen()` is unchanged — this probe's role is only to confirm the port is bindable; the real bind that follows succeeds in practice once the gateway socket is fully released by the kernel after force-kill. Verified end-to-end on Windows portable USB (Clash TUN proxy enabled, which triggers more aggressive OS socket churn): post-fix, port becomes available in <500 ms after force-kill instead of timing out at 30 000 ms.

… bump timeouts Builds on the previous TIME_WAIT probe fix (ValueCell-ai#964). Customer logs from Windows portable installs (pc-6545 reproducer) showed the gateway still deadlocking even with `exclusive: false` on the probe socket: 05:58:33 Gateway did not exit in time, force-killing (pid=13092) 05:59:05 Port 18789 still occupied after 30000ms; aborting startup Investigation revealed three root causes the previous PR did not address: 1. Force-kill was fire-and-forget. `terminateOwnedGatewayProcess` kicked off `taskkill /F /PID` then resolved its outer promise immediately on the 5 s timeout, even though the kill itself could still be queued by Windows (security software, ProcessProtection). The next gateway start would race the kill and find the port held by a still-running PID — exactly what we observed (PID 13092 was still LISTENING when we remoted in 5 minutes later). 2. Graceful shutdown window of 5 s was too short. OpenClaw with the dingtalk + feishu + wecom + openclaw-weixin extensions routinely takes 6 - 10 s to close all websocket connections and the http server cleanly. Almost every restart hit the timeout and triggered the force-kill path, which then ran into (1). 3. No post-kill settle. Even a successful taskkill takes a beat before the kernel actually reclaims the socket, so the immediate next probe in `waitForPortFree` saw it as occupied. Changes: - `terminateWindowsProcessTree` now awaits taskkill's callback (was fire-and-forget) and bumps its inner timeout from 5 s to 10 s. - Graceful shutdown timeout: 5 s to 15 s. - After both graceful exit and forced kill, sleep `postKillSettleMs` (2 s on Windows, 500 ms elsewhere) before resolving so the parent doesn't race the kernel's socket cleanup. - `waitForPortFree` default 30 s to 60 s, and at the half-way mark it now attempts a one-shot rescue: identify the LISTENING PID via netstat and force-kill it. Catches the case where an earlier shutdown's force-kill failed entirely (blocked by AV) and an orphaned process is keeping the socket alive. - The `findExistingGatewayProcess` call site bumped its `waitForPortFree` argument to match (30 s to 60 s). This is a targeted patch on top of the existing `exclusive: false` probe fix in ValueCell-ai#964, not a replacement. Both are needed: the probe-side fix allows binding through TIME_WAIT, this PR ensures the listener is actually gone first. Verified: `pnpm typecheck` clean. End-to-end on Windows portable USB build with Clash TUN proxy: previously the gateway would deadlock 1-2 times per session after channel-config saves; post-fix, no deadlock observed in 6 hours of testing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9f1a54dcf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T06:33:16Z

    const forkEnv: Record<string, string | undefined> = {
      ...baseEnvPatched,
      ...uvEnv,
+      OPENCLAW_EMBEDDED_IN: 'U-Claw',


Keep embedded app marker consistent for doctor subprocess

Set OPENCLAW_EMBEDDED_IN to the same app id used by the rest of ClawX launches. This change introduces U-Claw only in runOpenClawDoctorRepair, while other ClawX-managed OpenClaw entry points use ClawX (for example electron/utils/openclaw-cli.ts and resources/cli/*). Running doctor with a different embedded marker can make OpenClaw treat the subprocess as a different host context, causing doctor-repair behavior to diverge from normal ClawX runtime behavior.

Useful? React with 👍 / 👎.

dongsheng123132 added 2 commits May 2, 2026 14:21

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts#966

fix(gateway): wait for force-kill to complete + post-kill OS settle + bump timeouts#966
dongsheng123132 wants to merge 2 commits into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/gateway-force-kill-grace

dongsheng123132 commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dongsheng123132 commented May 2, 2026

⚠️ Depends on #964

Summary

Three root causes addressed

1. Force-kill was fire-and-forget

2. Graceful shutdown window of 5 s was too short

3. No post-kill settle

Bonus: rescue path for AV-blocked kills

Verification

Type of Change

Pre-Submission Checklist

Related

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant