Skip to content

Fix connection snapshot truth#918

Merged
shanselman merged 4 commits into
openclaw:mainfrom
bkudiess:bkudiess-connection-snapshot-truth
Jul 2, 2026
Merged

Fix connection snapshot truth#918
shanselman merged 4 commits into
openclaw:mainfrom
bkudiess:bkudiess-connection-snapshot-truth

Conversation

@bkudiess

@bkudiess bkudiess commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Connection truth was fragmented across manager snapshots, legacy AppState.Status, raw gateway status callbacks, MCP startup state, and UI-specific projections; this allowed false healthy/connected states.
  • Why it matters: Users could see contradictory status (for example Connection page connected while top pill stayed connecting) or a healthy-looking app when node/MCP startup was blocked.
  • What changed: Made GatewayConnectionSnapshot / OverallConnectionState the manager-owned lifecycle truth; added node intent/blockers; made MCP startup honest; updated Connection page, top pill, tray/menu/dashboard, Command Center, notifications, and MCP app status/settings behavior to consume derived snapshot semantics.
  • User impact: Connection surfaces now agree on connected/degraded/pairing/MCP-only/MCP-failed states, and node-mode toggles/reconnects recover reliably.
  • What did NOT change (scope boundary): Did not introduce a separate ConnectionTruth model; deeper removal of legacy ConnectionStatus read-side consumers is intentionally deferred to a follow-up cleanup PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs / instructions
  • Tests / validation
  • Security hardening
  • Chore / infra

Scope (select all touched areas)

  • Tray / WinUI UX
  • Windows node capability
  • Local MCP / winnode
  • Gateway / connection / pairing
  • Setup / onboarding
  • Permissions / privacy / security
  • Tests / CI / docs

Linked Issue/PR

Validation

  • ./build.ps1 — passed
  • dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj --no-restore — passed: 2686 passed, 31 skipped
  • dotnet test .\tests\OpenClaw.Tray.Tests\OpenClaw.Tray.Tests.csproj --no-restore — passed: 1488 passed
  • dotnet test .\tests\OpenClaw.Connection.Tests\OpenClaw.Connection.Tests.csproj --no-restore — passed: 395 passed
  • dotnet test .\tests\OpenClaw.WinNode.Cli.Tests\OpenClaw.WinNode.Cli.Tests.csproj --no-restore — passed: 120 passed

Notes:

  • OpenClaw.WinNode.Cli.Tests needed one initial restore/no-restore warmup in this fresh worktree; the final --no-restore run passed.
  • During earlier validation, known flaky timeout tests (AssistantBridgeServiceTests.StartListenServiceAsync_KillsTimedOutBackendCommand, an ExecApprovals runtime proof, and a token-recovery wait) failed once and passed on rerun. The final validation set above passed.

Real behavior proof

  • Environment tested: Windows ARM64, real %APPDATA%\OpenClawTray profile, local OpenClawGateway via ws://localhost:18789, PR branch bkudiess-connection-snapshot-truth.
  • PR head / commit tested: 1efcfa19
  • Exact steps or command run:
    • Launched C:\Projects\copilot-worktrees\openclaw-windows-node\bkudiess-solid-fishstick\src\OpenClaw.Tray.WinUI\bin\Debug\net10.0-windows10.0.22621.0\win-arm64\OpenClaw.Tray.WinUI.exe.
    • Used local MCP HTTP at http://127.0.0.1:8765/ with bearer token from %APPDATA%\OpenClawTray\mcp-token.txt.
    • Called JSON-RPC tools/list, tools/call app.status, tools/call app.menu, and tools/call app.nodes on the current PR head.
  • Evidence after fix:
    • tools/list returned 48 tools and included app.status, app.menu, app.connection.reconnect, app.connection.reconnectNode, and app.settings.set.
    • Current-head app.status output includes manager-owned state fields:
      {
        "connectionStatus": "Connected",
        "overallState": "Ready",
        "operatorState": "Connected",
        "nodeState": "Connected",
        "nodeConnected": true,
        "nodePaired": true,
        "nodePendingApproval": false,
        "nodeError": null,
        "gatewayVersion": "2026.6.10",
        "sessionCount": 0,
        "nodeCount": 1,
        "operatorScopes": ["operator.admin", "operator.pairing", "operator.read", "operator.write"],
        "operatorDeviceId": null
      }
    • Current-head app.menu status item includes manager-owned state fields:
      {
        "type": "status",
        "status": "Connected",
        "overallState": "Ready",
        "nodeState": "Connected",
        "nodeError": null
      }
    • Current-head app.nodes returned online Windows node Windows Node (PERSEID) with CapabilityCount: 8 and CommandCount: 28.
    • Earlier current-head real-profile proof also verified app.settings.set { "name": "EnableNodeMode", "value": "false" } applies the settings lifecycle and changes app.status to nodeConnected:false / nodePaired:false; re-enabling Node mode plus app.connection.reconnectNode restores nodeConnected:true / nodePaired:true.
  • Observed result: Gateway/operator and node are connected and paired on the real profile; MCP clients now receive explicit overallState, operatorState, nodeState, and nodeError fields instead of only legacy connectionStatus.
  • Screenshot/artifact links verified? (Yes/No/N/A): N/A.
  • Not verified / blocked: Computer-use UI inspection/screenshot is blocked in this environment because Copilot Computer Use is not approved for OpenClaw.Tray.WinUI and the host cannot show an approval prompt. Runtime proof uses raw MCP JSON-RPC plus real-profile gateway/node status. I did not intentionally induce a real degraded/blocked node state in the user profile to avoid corrupting paired credentials; blocked/degraded node paths are covered by focused manager tests (HandshakeSucceeded_NodeConnectorThrows_ReportsBlockedNode, HandshakeSucceeded_PreviousNodeDisconnectThrows_ReportsBlockedNode, missing credential/connector/record tests, and projection/source-contract tests).

Rubber-duck / review notes:

  • Ran multiple rubber-duck/ClawSweeper reviews and fixed all blocking/high-confidence findings, including connector-throw node startup blocking, previous connector retirement blocking, and explicit manager snapshot fields in MCP app.status/app.menu: stale node intent after disabling Node mode, stale node errors masking pairing, false Ready snapshot before node blocker, generation-guarded node blockers, Command Center status precedence, top pill refresh on every manager snapshot, MCP settings lifecycle, and MCP app.status snapshot source.
  • Investigated Redesign WSL gateway setup and OpenClaw onboard UX #792 overlap. Actual overlap is limited to src/OpenClaw.Tray.WinUI/App.xaml.cs and tests/OpenClaw.Tray.Tests/AppRefactorContractTests.cs; synthetic merge with Redesign WSL gateway setup and OpenClaw onboard UX #792 had 0 conflict markers.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No new capability permissions. Existing MCP app settings behavior now applies the same reconnect/reload side effects as UI settings save.
  • Secrets/tokens handling changed? (Yes/No): No.
  • New/changed network calls? (Yes/No): No new external network calls. Local MCP HTTP behavior/documentation updated.
  • Command/tool execution surface changed? (Yes/No): Yes, existing app.settings.set now applies settings lifecycle after persisting; this makes MCP behavior match UI settings save instead of silently writing config only. Existing app.status / app.menu outputs now include explicit manager-owned state fields.
  • Data access scope changed? (Yes/No): No.
  • If any Yes, explain risk + mitigation: app.settings.set already existed and was allowlisted to safe non-secret settings only. Applying the normal settings lifecycle is expected behavior and is covered by source contract tests plus real MCP proof. Added app.status / app.menu fields expose connection state already visible in the app; no secrets are included.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes.
  • Config/env changes? (Yes/No): No.
  • Migration needed? (Yes/No): No.
  • If yes, exact upgrade steps: N/A.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only conversations that still need reviewer or maintainer judgment.

No inline bot review conversations exist yet; ClawSweeper feedback has been addressed in branch commits and PR-body proof updates.

Make GatewayConnectionManager snapshots the lifecycle source of truth for connection surfaces. Capture node intent/blockers before startup can silently skip, make MCP startup failures visible, and keep legacy status consumers derived from manager state while preserving operator-live behavior.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
@clawsweeper

clawsweeper Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codex review: needs real behavior proof before merge. Reviewed July 2, 2026, 3:11 PM ET / 19:11 UTC.

Summary
The PR makes GatewayConnectionSnapshot/OverallConnectionState the manager-owned lifecycle truth, exposes node intent/blockers through tray/WinUI and MCP app tools, and adds connection/MCP regression tests and docs.

Reproducibility: yes. Source inspection of current main shows operator-connected plus node-idle can still derive a healthy-looking connected state without node intent truth; I did not run the Windows app in this read-only review.

Review metrics: 3 noteworthy metrics.

  • Changed surface: 33 files, +1913/-181. The diff spans connection lifecycle, tray UI, MCP app tools, docs, and tests, so maintainers should review it as an availability-sensitive change.
  • MCP app tools affected: 3 changed: app.status, app.menu, app.settings.set. These are agent-visible tool contracts, and the latest app.settings.set behavior needs current-head runtime proof.
  • Proof head mismatch: body tested 1efcfa1; current head 3eaea54. Real behavior proof is strong for the earlier head but does not directly prove the latest MCP settings-error patch.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦐 gold shrimp
Patch quality: 🐚 platinum hermit
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Proof guidance:

  • [P1] Needs stronger real behavior proof before merge: The PR body includes live MCP output, but it proves 1efcfa1 while current head is 3eaea54, so the latest MCP settings-error behavior needs current-head proof. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Mantis proof suggestion
Visible tray and Connection-page proof would materially help because the PR changes user-facing status surfaces beyond raw MCP output. A maintainer can ask Mantis to capture proof by posting this exact PR comment:

@openclaw-mantis visual task: verify tray, top pill, Connection page, Command Center, and MCP app.status show degraded or blocked node state instead of healthy connected when node startup is blocked.

Risk before merge

  • [P1] The latest head 3eaea54 changes app.settings.set MCP startup-failure behavior, but the PR body's real behavior proof still names 1efcfa1 as the tested commit.
  • [P1] The diff rewires node startup, reconnect, blocked-state publication, MCP startup reporting, and settings side effects, so a regression could affect live connection availability after merge.
  • [P1] The related open PR at Redesign WSL gateway setup and OpenClaw onboard UX #792 overlaps App.xaml.cs and AppRefactorContractTests.cs; GitHub reports this PR mergeable, but maintainers should still review the shared areas together.

Maintainer options:

  1. Refresh current-head proof and review availability paths (recommended)
    Before merge, require proof from current head 3eaea54 and a maintainer review of node startup, reconnect, blocked-state publication, and MCP settings lifecycle together.
  2. Accept MCP-first proof with explicit ownership
    Maintainers may intentionally accept the stale visual/UI proof gap if they are comfortable owning the latest MCP settings-error patch based on tests and code review.
  3. Pause behind overlapping connection UX work
    If the App.xaml.cs overlap with Redesign WSL gateway setup and OpenClaw onboard UX #792 is too hard to review together, pause this PR until the shared surface settles.

Next step before merge

  • [P1] Needs current-head real behavior proof and maintainer availability/overlap review; there is no narrow automated repair to queue.

Security
Cleared: No concrete security or supply-chain regression was found; the diff changes existing MCP app-tool outputs and safe settings lifecycle without broadening secrets, dependencies, workflows, or the settings allowlist.

Review details

Best possible solution:

Refresh the real behavior proof against current head 3eaea54, including the latest app.settings.set MCP startup-failure behavior, then land after maintainer review of the availability paths and App.xaml.cs overlap.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection of current main shows operator-connected plus node-idle can still derive a healthy-looking connected state without node intent truth; I did not run the Windows app in this read-only review.

Is this the best way to solve the issue?

Yes as an implementation direction. Making GatewayConnectionManager snapshots the lifecycle source of truth matches the repository architecture; the remaining blocker is current-head proof and maintainer review, not a different code approach.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against f89a88a6baf4.

Label changes

Label changes:

  • add rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦐 gold shrimp and patch quality is 🐚 platinum hermit.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body includes live MCP output, but it proves 1efcfa1 while current head is 3eaea54, so the latest MCP settings-error behavior needs current-head proof. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦐 gold shrimp, so this older rating label is no longer current.
  • remove status: 👀 ready for maintainer look: Current PR status label is status: 📣 needs proof.
  • remove proof: sufficient: Current real behavior proof status is insufficient, not sufficient.

Label justifications:

  • P2: This is a normal-priority connection correctness PR with user-visible tray and agent-visible MCP status impact.
  • merge-risk: 🚨 availability: Merging this PR changes node startup, reconnect, blocked-state publication, and MCP startup reporting paths that can affect connection availability behavior.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦐 gold shrimp and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body includes live MCP output, but it proves 1efcfa1 while current head is 3eaea54, so the latest MCP settings-error behavior needs current-head proof. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

What I checked:

Likely related people:

  • shanselman: Current-main blame for GatewayConnectionSnapshot/GatewayConnectionManager routes through 4166e0f, and the latest PR head commit adds the MCP settings-error policy patch. (role: recent area contributor and latest PR patch author; confidence: high; commits: 4166e0fd63f8, 3eaea54b7ca8; files: src/OpenClaw.Connection/GatewayConnectionSnapshot.cs, src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Tray.WinUI/App.CapabilityHandlers.cs)
  • bkudiess: Before this PR, the same contributor authored merged work on accurate node mode and MCP-only states in Surface accurate node mode and MCP-only states #827, which overlaps the status/proof surface. (role: prior merged related area contributor; confidence: high; commits: 9ac8cfb24f35; files: src/OpenClaw.Tray.WinUI/App.xaml.cs, src/OpenClaw.Tray.WinUI/Pages/ConnectionPagePlan.cs, tests/OpenClaw.Tray.Tests/NodeModeUiStateTests.cs)
  • vincentkoc: Merged PR fix(pairing): clarify reconnect and startup state #616 changed connection snapshots and startup-state observability around credential sources. (role: earlier connection snapshot contributor; confidence: medium; commits: 260fb90c6dc7; files: src/OpenClaw.Connection/ConnectionStateMachine.cs, src/OpenClaw.Connection/GatewayConnectionManager.cs, src/OpenClaw.Connection/GatewayConnectionSnapshot.cs)
  • calebeden: Recent merged PR Fix reset session dialog localization #919 touched App.xaml.cs, which this PR also changes, though that work was localization-adjacent rather than lifecycle-owned. (role: recent adjacent contributor; confidence: low; commits: bbd18d431766; files: src/OpenClaw.Tray.WinUI/App.xaml.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jul 1, 2026
Publish a blocked node snapshot if node connector startup throws before status events can transition the manager. Add a focused regression test for the throwing connector path.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
@bkudiess

bkudiess commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jul 1, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Publish blocked node snapshots when previous node connector retirement fails. Include manager-owned overall/node fields in MCP app.status and app.menu so MCP clients can see degraded or blocked node truth instead of only legacy connected status.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
@bkudiess

bkudiess commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jul 2, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jul 2, 2026
@bkudiess

bkudiess commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

1 similar comment
@bkudiess

bkudiess commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jul 2, 2026
@bkudiess bkudiess marked this pull request as ready for review July 2, 2026 17:04
Keep PR 918 current with main and add a maintainer patch so app.settings.set surfaces MCP startup failures as tool errors while MCP startup notifications are shown or dismissed from a single runtime policy.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels Jul 2, 2026
@clawsweeper clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jul 2, 2026
@shanselman shanselman merged commit 74604ae into openclaw:main Jul 2, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal priority bug or improvement with limited blast radius. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants