From 618b6f8093253ac8421198deb6aedb13f5259c0f Mon Sep 17 00:00:00 2001 From: Ashmit Biswas Date: Sat, 2 May 2026 22:28:36 +0530 Subject: [PATCH] fix(doctor,docs): F-3 mcp_orphans + F-4 OAuth-headless note MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wraps up the remaining backlog items from docs/test-findings.md. After this PR, every finding from the first end-to-end test run is either fixed or explicitly documented. F-3 — `mcp_orphans` doctor check ================================= Detects ``canopy-mcp`` processes whose parent died (PPID=1, reparented to launchd / init). Severity: info (orphans are idle), auto-fixable (SIGTERM → 2s grace → SIGKILL). - Skips current process + parent so doctor invoked from inside an MCP context can't report itself. - Skips processes whose parent is still alive — the original test- findings F-3 observation conflated "many running canopy-mcp" (normal: one per editor window) with "orphans" (rare: editor crashed). The check correctly fires only on the latter. - Falls back to empty list when ``ps`` is missing or fails — the check is non-fatal in any environment. The repair function is idempotent and silent on already-dead PIDs (``ProcessLookupError`` is the success path). Permission errors are surfaced as a partial-success result, not a crash. Doctor category count: 16 → 17. Updated CLAUDE.md, docs/mcp.md, docs/commands.md, docs/agents.md, and the using-canopy skill. F-4 — OAuth-headless documentation ==================================== Added a "Heads up — OAuth needs a TTY" callout in docs/mcp.md under the HTTP+OAuth section. Documents: - The first-call browser flow needs a TTY-attached process. - Headless invocations with missing/expired tokens hang waiting for a redirect that can't arrive. - For tests: exercise providers directly with mocked transports. Found while scripting the docs/test-plan.md run — Linear MCP from ``python -c "..."`` hung. Now in the docs as a known constraint. Tests: +7 (test_doctor: 5 for check_mcp_orphans, 2 for repair_mcp_orphans + helper unit). Suite: 644 → 651 passing. --- CLAUDE.md | 2 +- docs/agents.md | 2 +- docs/commands.md | 2 +- docs/mcp.md | 6 +- docs/test-findings.md | 13 +- src/canopy/actions/doctor.py | 115 +++++++++++++++- .../agent_setup/skills/using-canopy/SKILL.md | 2 +- tests/test_doctor.py | 125 ++++++++++++++++++ 8 files changed, 253 insertions(+), 14 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 1838e27..049ccc8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -105,7 +105,7 @@ For integration testing against real services, see `~/projects/canopy-test/` (me Grouped by topic. Run with `canopy-mcp` (entry point) or `python -m canopy.mcp.server`. ``` -Meta: version, doctor # version handshake + 16-category recovery primitive +Meta: version, doctor # version handshake + 17-category recovery primitive Workspace: workspace_status, workspace_context, workspace_config, workspace_reinit Feature: feature_create, feature_list, feature_status, feature_diff, feature_changes, feature_merge_readiness, feature_paths, feature_done, diff --git a/docs/agents.md b/docs/agents.md index 310ae53..a9f0705 100644 --- a/docs/agents.md +++ b/docs/agents.md @@ -160,7 +160,7 @@ When you see a `BlockerError`, read `fix_actions[0]` and decide whether to follo ### Recovery: when canopy itself looks broken -If a canopy call returns an unexpected error — `KeyError` from a state read, "feature not found" for a feature you just created, a path that should exist but doesn't — call `mcp__canopy__doctor` first. It diagnoses 16 categories of state-file drift and install-staleness, returning each issue with a `code`, `severity`, `expected`, `actual`, and `auto_fixable` flag. +If a canopy call returns an unexpected error — `KeyError` from a state read, "feature not found" for a feature you just created, a path that should exist but doesn't — call `mcp__canopy__doctor` first. It diagnoses 17 categories of state-file drift and install-staleness, returning each issue with a `code`, `severity`, `expected`, `actual`, and `auto_fixable` flag. Typical recovery flow: diff --git a/docs/commands.md b/docs/commands.md index 24a3e9f..4f64b8c 100644 --- a/docs/commands.md +++ b/docs/commands.md @@ -99,7 +99,7 @@ Write actions and execution. | Command | What it does | |---|---| -| `canopy doctor [-v] [--feature ]` | Diagnose 16 categories of state-file drift + install staleness. Reports `errors`/`warnings`/`info` with structured `code`, `expected`, `actual`, and per-issue `fix_action`. **Run this first** when any other canopy operation returns an unexpected error — most "something is off" cases trace to one of these categories. `--json` returns the full report shape `{issues, summary, fixed, skipped}`. | +| `canopy doctor [-v] [--feature ]` | Diagnose 17 categories of state-file drift + install staleness (incl. `mcp_orphans` from F-3). Reports `errors`/`warnings`/`info` with structured `code`, `expected`, `actual`, and per-issue `fix_action`. **Run this first** when any other canopy operation returns an unexpected error — most "something is off" cases trace to one of these categories. `--json` returns the full report shape `{issues, summary, fixed, skipped}`. | | `canopy doctor --fix` | Repair every `auto_fixable=true` issue. Examples: rewrite `heads.json` from live git, drop orphan worktree dirs via `git worktree remove --force`, reinstall a missing post-checkout hook, re-resolve broken `active_feature.json` paths, write a missing `.mcp.json` entry, reinstall the `using-canopy` skill. | | `canopy doctor --fix-category ` | Repair just one category (`heads`, `active_feature`, `worktrees`, `hooks`, `preflight`, `features`, `branches`, `cli`, `mcp`, `skill`, `vsix`). Implies `--fix`. | | `canopy doctor --clean-vsix` | Required gate for the destructive `vsix_duplicates` repair (removes all but the newest `singularityinc.canopy-*` install dir). Other repairs are unaffected. | diff --git a/docs/mcp.md b/docs/mcp.md index c14e657..74725ed 100644 --- a/docs/mcp.md +++ b/docs/mcp.md @@ -33,7 +33,7 @@ Grouped by topic. Every tool is alias-aware where it accepts a feature input. | Tool | Description | |---|---| | `version` | `{cli_version, mcp_version, schema_version}` for the doctor handshake. The extension calls this once at startup; the doctor uses it to flag CLI/MCP version drift. | -| `doctor` | Diagnose state-file integrity + install staleness; optionally repair. 16 categories, single tool. **The recovery entry point** — when any other call returns an unexpected error, agents should call `doctor` first to see whether state is corrupted. Returns `{issues, summary, fixed, skipped, ...}`. | +| `doctor` | Diagnose state-file integrity + install staleness; optionally repair. 17 categories, single tool. **The recovery entry point** — when any other call returns an unexpected error, agents should call `doctor` first to see whether state is corrupted. Returns `{issues, summary, fixed, skipped, ...}`. | #### Workspace @@ -148,7 +148,9 @@ For hosted servers like Linear's official MCP — no API key needed: } ``` -First call opens the browser to the OAuth authorize URL; canopy spins up a one-shot HTTP server on `localhost:33418` to capture the redirect. Tokens cache at `~/.canopy/mcp-tokens/.{client,tokens}.json`. Subsequent calls reuse the cached token silently. +First call opens the browser to the OAuth authorize URL; canopy spins up a one-shot HTTP server on `localhost:33418` to capture the redirect. Tokens cache at `~/.canopy/mcp-tokens/.{client,tokens}.json`. Subsequent calls reuse the cached token silently. Tokens auto-refresh as long as the refresh token is valid. + +> **Heads up — OAuth needs a TTY.** The first-call browser flow requires a TTY-attached process (Claude Code, your shell, the canopy CLI). If you invoke an MCP method *headlessly* — e.g. `python -c "from canopy.mcp.server import issue_list_my_issues; issue_list_my_issues()"` from a script — and the cached token is missing or expired, the OAuth handshake will hang waiting for a redirect that can never arrive (test-findings F-4). For tests, exercise providers directly via their classes with `call_tool` mocked at the module boundary, or rely on a pre-authorised session. For HTTP servers that use header auth instead of OAuth: diff --git a/docs/test-findings.md b/docs/test-findings.md index 72e1765..d89f893 100644 --- a/docs/test-findings.md +++ b/docs/test-findings.md @@ -47,13 +47,12 @@ Running any workspace-scoped command (e.g. `canopy setup-agent --check`, `canopy Lesson for future test runs: capture exit codes inline (`canopy ... ; EC=$?`), not after intervening commands. -### F-3: stale `canopy-mcp` processes accumulate +### F-3: stale `canopy-mcp` processes accumulate ~~(partially misread)~~ -`ps aux | grep canopy-mcp` shows 8+ stale processes from earlier in the week. Each one is a hung MCP server from a previous session that didn't get reaped. +Original observation was 8+ canopy-mcp processes in `ps`. **On closer inspection most are not orphans** — their PPIDs are still alive (live VSCode / Cursor / Claude Desktop windows, each with its own canopy-mcp). One MCP per editor window is normal. The genuine bug is the *true* orphan case: an editor crashes hard, its MCP child gets reparented to PID 1, and nothing reaps it. -- **Severity:** low — they're idle (memory ≤96 KB each); not actively harming. -- **Cause likely:** when an agent / IDE disconnects from MCP without a clean shutdown handshake, the stdio server hangs waiting for stdin. -- **Fix candidate:** doctor could add a `mcp_orphans` check that lists processes whose parent died, with `--clean` to reap them. Or the MCP server could exit on EOF rather than blocking. +- **Fix shipped:** `mcp_orphans` doctor check (severity: info, auto-fixable). Detects PPID=1 canopy-mcp processes, reaps with SIGTERM then SIGKILL after a 2s grace. The check explicitly skips multi-editor-window cases (parent alive → not an orphan, doesn't fire). +- **Lesson for future test runs:** "lots of processes" ≠ "orphans". Check PPID before claiming a bug. ### F-4: Linear MCP from a headless Python invocation hangs @@ -148,8 +147,8 @@ This is the recovery scenario M1 was built for. Detection works end-to-end again - [ ] **F-6 fix (P1)** — make `cmd_issue` render `Issue.to_dict()` directly so CLI + MCP agree. Drop the legacy raw-state shape. Update docs/commands.md. ~30 min. - [ ] **F-1 fix (P1)** — improve the "no canopy.toml" error message to explain canopy's mental model (workspace = non-git directory holding repos + canopy.toml). ~10 min. - [ ] **F-5 fix (P2)** — add `cmd_issues` for parity with the MCP `issue_list_my_issues`, OR defer to Phil's PR (which has it). -- [ ] **F-3 backlog** — doctor `mcp_orphans` check + reaper. -- [ ] **F-4 backlog** — Linear-headless smoke test using cached tokens; document the OAuth-required-in-tty constraint in docs/mcp.md. +- [x] **F-3** — doctor `mcp_orphans` check + reaper. Detects PPID=1 canopy-mcp processes; SIGTERM with 2s grace then SIGKILL. Auto-fixable (info severity). Original observation was partially misread — see updated F-3 above. +- [x] **F-4** — `docs/mcp.md` "Heads up — OAuth needs a TTY" callout under the HTTP+OAuth section. Documents the headless-hang failure mode + workaround. - [ ] **`canopy doctor --fix`** — on a follow-up session, intentionally clean canopy-test's real drift (heads.json + missing worktrees + vsix duplicates) to validate the repair side end-to-end. ## Test-data cleanup diff --git a/src/canopy/actions/doctor.py b/src/canopy/actions/doctor.py index f86e4a1..f3c29f9 100644 --- a/src/canopy/actions/doctor.py +++ b/src/canopy/actions/doctor.py @@ -28,7 +28,9 @@ import json import os import shutil +import signal import subprocess +import time from dataclasses import dataclass, field from datetime import datetime, timezone from pathlib import Path @@ -640,6 +642,68 @@ def check_skill_stale(workspace: Workspace) -> list[Issue]: _VSIX_PREFIX = "singularityinc.canopy-" +def check_mcp_orphans(workspace: Workspace) -> list[Issue]: + """Detect orphaned ``canopy-mcp`` processes (parent died, reparented to PID 1). + + Stale MCP servers accumulate when an editor / agent disconnects without + cleanly closing stdin — the server keeps running waiting for input + that never comes. Each orphan is idle but holds a venv-Python process + + a few MB of RSS. ``--fix`` reaps them with SIGTERM (then SIGKILL + after a short grace) so the process table stays clean. + + See test-findings F-3 (~8 stale processes accumulated over a week of + real use of canopy-test before this was added). + """ + pids = _list_orphan_canopy_mcp_pids() + if not pids: + return [] + return [Issue( + code="mcp_orphans", + severity="info", + what=f"{len(pids)} orphaned canopy-mcp process(es) found (PPID=1)", + expected="0 orphans (each MCP server should exit when its parent disconnects)", + actual=str(len(pids)), + fix_action="canopy doctor --fix reaps them (SIGTERM, then SIGKILL after 2s)", + auto_fixable=True, + details={"pids": pids}, + )] + + +def _list_orphan_canopy_mcp_pids() -> list[int]: + """Return PIDs of running ``canopy-mcp`` processes whose parent is PID 1. + + Uses ``ps`` (cross-platform on macOS + Linux) — no extra dependency. + Skips the current process and its ancestors so a doctor invocation + from inside an MCP context can't report itself. + """ + try: + out = subprocess.run( + ["ps", "-eo", "pid=,ppid=,command="], + capture_output=True, text=True, timeout=5, + ) + except (FileNotFoundError, subprocess.TimeoutExpired): + return [] + if out.returncode != 0: + return [] + self_pid = os.getpid() + self_ppid = os.getppid() + skip = {self_pid, self_ppid} + out_pids: list[int] = [] + for line in out.stdout.splitlines(): + try: + pid_s, ppid_s, command = line.lstrip().split(None, 2) + pid, ppid = int(pid_s), int(ppid_s) + except (ValueError, IndexError): + continue + if pid in skip or ppid in skip: + continue + if "canopy-mcp" not in command: + continue + if ppid == 1: + out_pids.append(pid) + return sorted(out_pids) + + def check_vsix_duplicates(workspace: Workspace) -> list[Issue]: """Multiple ``singularityinc.canopy-*`` dirs in ~/.vscode/extensions/.""" ext_dir = Path.home() / ".vscode" / "extensions" @@ -683,6 +747,7 @@ def check_vsix_duplicates(workspace: Workspace) -> list[Issue]: "mcp_missing_in_workspace": ("mcp", check_mcp_missing_in_workspace), "skill_missing": ("skill", check_skill_missing), "skill_stale": ("skill", check_skill_stale), + "mcp_orphans": ("mcp", check_mcp_orphans), "vsix_duplicates": ("vsix", check_vsix_duplicates), } @@ -959,6 +1024,53 @@ def repair_skill_stale(workspace: Workspace, issue: Issue) -> RepairResult: ) +def repair_mcp_orphans(workspace: Workspace, issue: Issue) -> RepairResult: + """SIGTERM listed orphan PIDs, then SIGKILL after a 2s grace. + + Skips PIDs we don't own (EPERM) silently — there's no graceful + recovery for a non-owned orphan and reporting one would just be noise. + """ + pids = list(issue.details.get("pids") or []) + if not pids: + return RepairResult(code=issue.code, success=True, + action_taken="no orphans to reap") + sent: list[int] = [] + failed: list[str] = [] + for pid in pids: + try: + os.kill(int(pid), signal.SIGTERM) + sent.append(int(pid)) + except ProcessLookupError: + continue # already gone — fine + except PermissionError: + failed.append(f"{pid}: permission denied") + continue + except Exception as e: # noqa: BLE001 + failed.append(f"{pid}: {e}") + continue + # Grace period for clean shutdown, then SIGKILL stragglers. + if sent: + time.sleep(2.0) + for pid in sent: + try: + os.kill(pid, 0) # probe — does the pid still exist? + except ProcessLookupError: + continue # gone, good + try: + os.kill(pid, signal.SIGKILL) + except ProcessLookupError: + continue + except Exception as e: # noqa: BLE001 + failed.append(f"{pid}: SIGKILL: {e}") + action = f"reaped {len(sent)} orphan(s)" + if failed: + return RepairResult( + code=issue.code, success=bool(sent), + action_taken=action, error="; ".join(failed), + ) + return RepairResult(code=issue.code, success=True, action_taken=action) + + def repair_vsix_duplicates(workspace: Workspace, issue: Issue) -> RepairResult: """Remove all but the newest matching extension dir.""" paths = [Path(p) for p in (issue.details.get("paths") or [])] @@ -997,6 +1109,7 @@ def repair_vsix_duplicates(workspace: Workspace, issue: Issue) -> RepairResult: "mcp_missing_in_workspace": repair_mcp_missing_in_workspace, "skill_missing": repair_skill_missing, "skill_stale": repair_skill_stale, + "mcp_orphans": repair_mcp_orphans, "vsix_duplicates": repair_vsix_duplicates, # cli_stale, mcp_stale, features_unknown_repo, branches_missing have # no auto-fix — repair returns surfaced advice via the issue's @@ -1050,7 +1163,7 @@ def doctor( issues = [i for i in issues if i.feature in (None, feature) or i.code in { "heads_stale", "hook_missing", "hook_chained_unsafe", "cli_stale", "mcp_stale", "mcp_missing_in_workspace", - "skill_missing", "skill_stale", "vsix_duplicates", + "mcp_orphans", "skill_missing", "skill_stale", "vsix_duplicates", }] all_issues.extend(issues) diff --git a/src/canopy/agent_setup/skills/using-canopy/SKILL.md b/src/canopy/agent_setup/skills/using-canopy/SKILL.md index 931f8b5..4fb3a28 100644 --- a/src/canopy/agent_setup/skills/using-canopy/SKILL.md +++ b/src/canopy/agent_setup/skills/using-canopy/SKILL.md @@ -80,7 +80,7 @@ When you see a `BlockerError`, the first step is to read `fix_actions[0]` and de ## Recovery: when canopy itself looks broken -If a canopy call returns an unexpected error — `KeyError` from a state read, a "feature not found" for one you just created, a worktree path that should exist but doesn't — call `mcp__canopy__doctor` first. It reports 16 categories of state-file drift + install-staleness, each with `code`, `severity`, `expected`/`actual`, and an `auto_fixable` flag. +If a canopy call returns an unexpected error — `KeyError` from a state read, a "feature not found" for one you just created, a worktree path that should exist but doesn't — call `mcp__canopy__doctor` first. It reports 17 categories of state-file drift + install-staleness, each with `code`, `severity`, `expected`/`actual`, and an `auto_fixable` flag. - `summary.errors == 0` → not a state problem; investigate the original error normally. - Errors present, mostly `auto_fixable: true` → call `doctor(fix=True)`; report `fixed`/`skipped` to the user. diff --git a/tests/test_doctor.py b/tests/test_doctor.py index 4cd5a65..1ed9269 100644 --- a/tests/test_doctor.py +++ b/tests/test_doctor.py @@ -28,6 +28,7 @@ check_hook_chained_unsafe, check_hook_missing, check_mcp_missing_in_workspace, + check_mcp_orphans, check_mcp_stale, check_preflight_stale, check_skill_missing, @@ -502,6 +503,130 @@ def test_check_vsix_duplicates_skipped_when_single(workspace_with_feature, tmp_p assert check_vsix_duplicates(ws) == [] +# ── mcp_orphans (F-3) ─────────────────────────────────────────────────── + + +def _ps_stub(rows: list[tuple[int, int, str]]): + """Build a subprocess.run-shaped result mimicking ``ps -eo pid=,ppid=,command=``.""" + from types import SimpleNamespace + text = "\n".join(f"{pid:>5} {ppid:>5} {cmd}" for pid, ppid, cmd in rows) + "\n" + return SimpleNamespace(returncode=0, stdout=text, stderr="") + + +def test_check_mcp_orphans_detects_ppid_1(workspace_with_feature, monkeypatch): + """A canopy-mcp process whose parent died (PPID=1) is an orphan.""" + rows = [ + (12345, 1, "/path/to/canopy-mcp"), # orphan + (23456, 999, "/path/to/canopy-mcp"), # parent alive — fine + (34567, 1, "/usr/bin/python3 -m something_unrelated"), # not canopy-mcp + ] + monkeypatch.setattr( + "canopy.actions.doctor.subprocess.run", + lambda *a, **kw: _ps_stub(rows), + ) + ws = _make_workspace(workspace_with_feature) + issues = check_mcp_orphans(ws) + assert len(issues) == 1 + assert issues[0].code == "mcp_orphans" + assert issues[0].severity == "info" + assert issues[0].auto_fixable is True + assert issues[0].details["pids"] == [12345] + + +def test_check_mcp_orphans_clean_when_no_orphans(workspace_with_feature, monkeypatch): + rows = [ + (12345, 999, "/path/to/canopy-mcp"), # parent alive + (23456, 1, "/usr/bin/something-else"), # PPID 1 but not canopy-mcp + ] + monkeypatch.setattr( + "canopy.actions.doctor.subprocess.run", + lambda *a, **kw: _ps_stub(rows), + ) + ws = _make_workspace(workspace_with_feature) + assert check_mcp_orphans(ws) == [] + + +def test_check_mcp_orphans_skips_self_and_parent(workspace_with_feature, monkeypatch): + """Doctor invoked from inside an MCP context shouldn't flag itself.""" + import os + self_pid = os.getpid() + self_ppid = os.getppid() + rows = [ + (self_pid, 1, "/path/to/canopy-mcp"), # this is us — skip + (self_ppid, 1, "/path/to/canopy-mcp"), # this is our parent — skip + (99999, 1, "/path/to/canopy-mcp"), # different orphan — keep + ] + monkeypatch.setattr( + "canopy.actions.doctor.subprocess.run", + lambda *a, **kw: _ps_stub(rows), + ) + ws = _make_workspace(workspace_with_feature) + issues = check_mcp_orphans(ws) + assert len(issues) == 1 + assert issues[0].details["pids"] == [99999] + + +def test_check_mcp_orphans_handles_ps_failure(workspace_with_feature, monkeypatch): + """If ps fails or isn't installed, the check returns no issues (not crashes). + + Monkeypatch the orphan-listing helper directly (not subprocess.run) + so we don't accidentally break git invocations from _make_workspace. + """ + ws = _make_workspace(workspace_with_feature) + monkeypatch.setattr( + "canopy.actions.doctor._list_orphan_canopy_mcp_pids", + lambda: [], # Helper internalises its own try/except over ps; we stub the result. + ) + assert check_mcp_orphans(ws) == [] + + +def test_list_orphan_helper_returns_empty_on_ps_missing(monkeypatch): + """Direct unit test of the helper: ps not on PATH → empty list, no exception.""" + from canopy.actions.doctor import _list_orphan_canopy_mcp_pids + def boom(*a, **kw): + raise FileNotFoundError("ps") + monkeypatch.setattr("canopy.actions.doctor.subprocess.run", boom) + assert _list_orphan_canopy_mcp_pids() == [] + + +def test_repair_mcp_orphans_sends_sigterm(workspace_with_feature, monkeypatch): + """The repair function calls os.kill on each listed PID.""" + from canopy.actions.doctor import repair_mcp_orphans, Issue + killed: list[tuple[int, int]] = [] + def fake_kill(pid, sig): + killed.append((pid, sig)) + # Pretend the process disappeared after SIGTERM so SIGKILL probe sees ProcessLookupError + if sig == 0: + raise ProcessLookupError + monkeypatch.setattr("canopy.actions.doctor.os.kill", fake_kill) + monkeypatch.setattr("canopy.actions.doctor.time.sleep", lambda s: None) + ws = _make_workspace(workspace_with_feature) + issue = Issue( + code="mcp_orphans", severity="info", + what="2 orphans", expected="0", actual="2", + fix_action="reap", auto_fixable=True, + details={"pids": [101, 202]}, + ) + result = repair_mcp_orphans(ws, issue) + assert result.success is True + sent_pids = [p for p, s in killed if s != 0] # ignore probe-with-sig=0 + assert sorted(sent_pids) == [101, 202] + + +def test_repair_mcp_orphans_noop_when_empty(workspace_with_feature): + from canopy.actions.doctor import repair_mcp_orphans, Issue + ws = _make_workspace(workspace_with_feature) + issue = Issue( + code="mcp_orphans", severity="info", + what="0 orphans", expected="0", actual="0", + fix_action="reap", auto_fixable=True, + details={"pids": []}, + ) + result = repair_mcp_orphans(ws, issue) + assert result.success is True + assert "no orphans" in result.action_taken + + # ── orchestrator ────────────────────────────────────────────────────────