Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ For integration testing against real services, see `~/projects/canopy-test/` (me
Grouped by topic. Run with `canopy-mcp` (entry point) or `python -m canopy.mcp.server`.

```
Meta: version, doctor # version handshake + 16-category recovery primitive
Meta: version, doctor # version handshake + 17-category recovery primitive
Workspace: workspace_status, workspace_context, workspace_config, workspace_reinit
Feature: feature_create, feature_list, feature_status, feature_diff,
feature_changes, feature_merge_readiness, feature_paths, feature_done,
Expand Down
2 changes: 1 addition & 1 deletion docs/agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ When you see a `BlockerError`, read `fix_actions[0]` and decide whether to follo

### Recovery: when canopy itself looks broken

If a canopy call returns an unexpected error — `KeyError` from a state read, "feature not found" for a feature you just created, a path that should exist but doesn't — call `mcp__canopy__doctor` first. It diagnoses 16 categories of state-file drift and install-staleness, returning each issue with a `code`, `severity`, `expected`, `actual`, and `auto_fixable` flag.
If a canopy call returns an unexpected error — `KeyError` from a state read, "feature not found" for a feature you just created, a path that should exist but doesn't — call `mcp__canopy__doctor` first. It diagnoses 17 categories of state-file drift and install-staleness, returning each issue with a `code`, `severity`, `expected`, `actual`, and `auto_fixable` flag.

Typical recovery flow:

Expand Down
2 changes: 1 addition & 1 deletion docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Write actions and execution.

| Command | What it does |
|---|---|
| `canopy doctor [-v] [--feature <f>]` | Diagnose 16 categories of state-file drift + install staleness. Reports `errors`/`warnings`/`info` with structured `code`, `expected`, `actual`, and per-issue `fix_action`. **Run this first** when any other canopy operation returns an unexpected error — most "something is off" cases trace to one of these categories. `--json` returns the full report shape `{issues, summary, fixed, skipped}`. |
| `canopy doctor [-v] [--feature <f>]` | Diagnose 17 categories of state-file drift + install staleness (incl. `mcp_orphans` from F-3). Reports `errors`/`warnings`/`info` with structured `code`, `expected`, `actual`, and per-issue `fix_action`. **Run this first** when any other canopy operation returns an unexpected error — most "something is off" cases trace to one of these categories. `--json` returns the full report shape `{issues, summary, fixed, skipped}`. |
| `canopy doctor --fix` | Repair every `auto_fixable=true` issue. Examples: rewrite `heads.json` from live git, drop orphan worktree dirs via `git worktree remove --force`, reinstall a missing post-checkout hook, re-resolve broken `active_feature.json` paths, write a missing `.mcp.json` entry, reinstall the `using-canopy` skill. |
| `canopy doctor --fix-category <c>` | Repair just one category (`heads`, `active_feature`, `worktrees`, `hooks`, `preflight`, `features`, `branches`, `cli`, `mcp`, `skill`, `vsix`). Implies `--fix`. |
| `canopy doctor --clean-vsix` | Required gate for the destructive `vsix_duplicates` repair (removes all but the newest `singularityinc.canopy-*` install dir). Other repairs are unaffected. |
Expand Down
6 changes: 4 additions & 2 deletions docs/mcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Grouped by topic. Every tool is alias-aware where it accepts a feature input.
| Tool | Description |
|---|---|
| `version` | `{cli_version, mcp_version, schema_version}` for the doctor handshake. The extension calls this once at startup; the doctor uses it to flag CLI/MCP version drift. |
| `doctor` | Diagnose state-file integrity + install staleness; optionally repair. 16 categories, single tool. **The recovery entry point** — when any other call returns an unexpected error, agents should call `doctor` first to see whether state is corrupted. Returns `{issues, summary, fixed, skipped, ...}`. |
| `doctor` | Diagnose state-file integrity + install staleness; optionally repair. 17 categories, single tool. **The recovery entry point** — when any other call returns an unexpected error, agents should call `doctor` first to see whether state is corrupted. Returns `{issues, summary, fixed, skipped, ...}`. |

#### Workspace

Expand Down Expand Up @@ -148,7 +148,9 @@ For hosted servers like Linear's official MCP — no API key needed:
}
```

First call opens the browser to the OAuth authorize URL; canopy spins up a one-shot HTTP server on `localhost:33418` to capture the redirect. Tokens cache at `~/.canopy/mcp-tokens/<server>.{client,tokens}.json`. Subsequent calls reuse the cached token silently.
First call opens the browser to the OAuth authorize URL; canopy spins up a one-shot HTTP server on `localhost:33418` to capture the redirect. Tokens cache at `~/.canopy/mcp-tokens/<server>.{client,tokens}.json`. Subsequent calls reuse the cached token silently. Tokens auto-refresh as long as the refresh token is valid.

> **Heads up — OAuth needs a TTY.** The first-call browser flow requires a TTY-attached process (Claude Code, your shell, the canopy CLI). If you invoke an MCP method *headlessly* — e.g. `python -c "from canopy.mcp.server import issue_list_my_issues; issue_list_my_issues()"` from a script — and the cached token is missing or expired, the OAuth handshake will hang waiting for a redirect that can never arrive (test-findings F-4). For tests, exercise providers directly via their classes with `call_tool` mocked at the module boundary, or rely on a pre-authorised session.

For HTTP servers that use header auth instead of OAuth:

Expand Down
13 changes: 6 additions & 7 deletions docs/test-findings.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,12 @@ Running any workspace-scoped command (e.g. `canopy setup-agent --check`, `canopy

Lesson for future test runs: capture exit codes inline (`canopy ... ; EC=$?`), not after intervening commands.

### F-3: stale `canopy-mcp` processes accumulate
### F-3: stale `canopy-mcp` processes accumulate ~~(partially misread)~~

`ps aux | grep canopy-mcp` shows 8+ stale processes from earlier in the week. Each one is a hung MCP server from a previous session that didn't get reaped.
Original observation was 8+ canopy-mcp processes in `ps`. **On closer inspection most are not orphans** — their PPIDs are still alive (live VSCode / Cursor / Claude Desktop windows, each with its own canopy-mcp). One MCP per editor window is normal. The genuine bug is the *true* orphan case: an editor crashes hard, its MCP child gets reparented to PID 1, and nothing reaps it.

- **Severity:** low — they're idle (memory ≤96 KB each); not actively harming.
- **Cause likely:** when an agent / IDE disconnects from MCP without a clean shutdown handshake, the stdio server hangs waiting for stdin.
- **Fix candidate:** doctor could add a `mcp_orphans` check that lists processes whose parent died, with `--clean` to reap them. Or the MCP server could exit on EOF rather than blocking.
- **Fix shipped:** `mcp_orphans` doctor check (severity: info, auto-fixable). Detects PPID=1 canopy-mcp processes, reaps with SIGTERM then SIGKILL after a 2s grace. The check explicitly skips multi-editor-window cases (parent alive → not an orphan, doesn't fire).
- **Lesson for future test runs:** "lots of processes" ≠ "orphans". Check PPID before claiming a bug.

### F-4: Linear MCP from a headless Python invocation hangs

Expand Down Expand Up @@ -148,8 +147,8 @@ This is the recovery scenario M1 was built for. Detection works end-to-end again
- [ ] **F-6 fix (P1)** — make `cmd_issue` render `Issue.to_dict()` directly so CLI + MCP agree. Drop the legacy raw-state shape. Update docs/commands.md. ~30 min.
- [ ] **F-1 fix (P1)** — improve the "no canopy.toml" error message to explain canopy's mental model (workspace = non-git directory holding repos + canopy.toml). ~10 min.
- [ ] **F-5 fix (P2)** — add `cmd_issues` for parity with the MCP `issue_list_my_issues`, OR defer to Phil's PR (which has it).
- [ ] **F-3 backlog** — doctor `mcp_orphans` check + reaper.
- [ ] **F-4 backlog** — Linear-headless smoke test using cached tokens; document the OAuth-required-in-tty constraint in docs/mcp.md.
- [x] **F-3** — doctor `mcp_orphans` check + reaper. Detects PPID=1 canopy-mcp processes; SIGTERM with 2s grace then SIGKILL. Auto-fixable (info severity). Original observation was partially misread — see updated F-3 above.
- [x] **F-4** — `docs/mcp.md` "Heads up — OAuth needs a TTY" callout under the HTTP+OAuth section. Documents the headless-hang failure mode + workaround.
- [ ] **`canopy doctor --fix`** — on a follow-up session, intentionally clean canopy-test's real drift (heads.json + missing worktrees + vsix duplicates) to validate the repair side end-to-end.

## Test-data cleanup
Expand Down
115 changes: 114 additions & 1 deletion src/canopy/actions/doctor.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@
import json
import os
import shutil
import signal
import subprocess
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
Expand Down Expand Up @@ -640,6 +642,68 @@ def check_skill_stale(workspace: Workspace) -> list[Issue]:
_VSIX_PREFIX = "singularityinc.canopy-"


def check_mcp_orphans(workspace: Workspace) -> list[Issue]:
"""Detect orphaned ``canopy-mcp`` processes (parent died, reparented to PID 1).

Stale MCP servers accumulate when an editor / agent disconnects without
cleanly closing stdin — the server keeps running waiting for input
that never comes. Each orphan is idle but holds a venv-Python process
+ a few MB of RSS. ``--fix`` reaps them with SIGTERM (then SIGKILL
after a short grace) so the process table stays clean.

See test-findings F-3 (~8 stale processes accumulated over a week of
real use of canopy-test before this was added).
"""
pids = _list_orphan_canopy_mcp_pids()
if not pids:
return []
return [Issue(
code="mcp_orphans",
severity="info",
what=f"{len(pids)} orphaned canopy-mcp process(es) found (PPID=1)",
expected="0 orphans (each MCP server should exit when its parent disconnects)",
actual=str(len(pids)),
fix_action="canopy doctor --fix reaps them (SIGTERM, then SIGKILL after 2s)",
auto_fixable=True,
details={"pids": pids},
)]


def _list_orphan_canopy_mcp_pids() -> list[int]:
"""Return PIDs of running ``canopy-mcp`` processes whose parent is PID 1.

Uses ``ps`` (cross-platform on macOS + Linux) — no extra dependency.
Skips the current process and its ancestors so a doctor invocation
from inside an MCP context can't report itself.
"""
try:
out = subprocess.run(
["ps", "-eo", "pid=,ppid=,command="],
capture_output=True, text=True, timeout=5,
)
except (FileNotFoundError, subprocess.TimeoutExpired):
return []
if out.returncode != 0:
return []
self_pid = os.getpid()
self_ppid = os.getppid()
skip = {self_pid, self_ppid}
out_pids: list[int] = []
for line in out.stdout.splitlines():
try:
pid_s, ppid_s, command = line.lstrip().split(None, 2)
pid, ppid = int(pid_s), int(ppid_s)
except (ValueError, IndexError):
continue
if pid in skip or ppid in skip:
continue
if "canopy-mcp" not in command:
continue
if ppid == 1:
out_pids.append(pid)
return sorted(out_pids)


def check_vsix_duplicates(workspace: Workspace) -> list[Issue]:
"""Multiple ``singularityinc.canopy-*`` dirs in ~/.vscode/extensions/."""
ext_dir = Path.home() / ".vscode" / "extensions"
Expand Down Expand Up @@ -683,6 +747,7 @@ def check_vsix_duplicates(workspace: Workspace) -> list[Issue]:
"mcp_missing_in_workspace": ("mcp", check_mcp_missing_in_workspace),
"skill_missing": ("skill", check_skill_missing),
"skill_stale": ("skill", check_skill_stale),
"mcp_orphans": ("mcp", check_mcp_orphans),
"vsix_duplicates": ("vsix", check_vsix_duplicates),
}

Expand Down Expand Up @@ -959,6 +1024,53 @@ def repair_skill_stale(workspace: Workspace, issue: Issue) -> RepairResult:
)


def repair_mcp_orphans(workspace: Workspace, issue: Issue) -> RepairResult:
"""SIGTERM listed orphan PIDs, then SIGKILL after a 2s grace.

Skips PIDs we don't own (EPERM) silently — there's no graceful
recovery for a non-owned orphan and reporting one would just be noise.
"""
pids = list(issue.details.get("pids") or [])
if not pids:
return RepairResult(code=issue.code, success=True,
action_taken="no orphans to reap")
sent: list[int] = []
failed: list[str] = []
for pid in pids:
try:
os.kill(int(pid), signal.SIGTERM)
sent.append(int(pid))
except ProcessLookupError:
continue # already gone — fine
except PermissionError:
failed.append(f"{pid}: permission denied")
continue
except Exception as e: # noqa: BLE001
failed.append(f"{pid}: {e}")
continue
# Grace period for clean shutdown, then SIGKILL stragglers.
if sent:
time.sleep(2.0)
for pid in sent:
try:
os.kill(pid, 0) # probe — does the pid still exist?
except ProcessLookupError:
continue # gone, good
try:
os.kill(pid, signal.SIGKILL)
except ProcessLookupError:
continue
except Exception as e: # noqa: BLE001
failed.append(f"{pid}: SIGKILL: {e}")
action = f"reaped {len(sent)} orphan(s)"
if failed:
return RepairResult(
code=issue.code, success=bool(sent),
action_taken=action, error="; ".join(failed),
)
return RepairResult(code=issue.code, success=True, action_taken=action)


def repair_vsix_duplicates(workspace: Workspace, issue: Issue) -> RepairResult:
"""Remove all but the newest matching extension dir."""
paths = [Path(p) for p in (issue.details.get("paths") or [])]
Expand Down Expand Up @@ -997,6 +1109,7 @@ def repair_vsix_duplicates(workspace: Workspace, issue: Issue) -> RepairResult:
"mcp_missing_in_workspace": repair_mcp_missing_in_workspace,
"skill_missing": repair_skill_missing,
"skill_stale": repair_skill_stale,
"mcp_orphans": repair_mcp_orphans,
"vsix_duplicates": repair_vsix_duplicates,
# cli_stale, mcp_stale, features_unknown_repo, branches_missing have
# no auto-fix — repair returns surfaced advice via the issue's
Expand Down Expand Up @@ -1050,7 +1163,7 @@ def doctor(
issues = [i for i in issues if i.feature in (None, feature) or i.code in {
"heads_stale", "hook_missing", "hook_chained_unsafe",
"cli_stale", "mcp_stale", "mcp_missing_in_workspace",
"skill_missing", "skill_stale", "vsix_duplicates",
"mcp_orphans", "skill_missing", "skill_stale", "vsix_duplicates",
}]
all_issues.extend(issues)

Expand Down
2 changes: 1 addition & 1 deletion src/canopy/agent_setup/skills/using-canopy/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ When you see a `BlockerError`, the first step is to read `fix_actions[0]` and de

## Recovery: when canopy itself looks broken

If a canopy call returns an unexpected error — `KeyError` from a state read, a "feature not found" for one you just created, a worktree path that should exist but doesn't — call `mcp__canopy__doctor` first. It reports 16 categories of state-file drift + install-staleness, each with `code`, `severity`, `expected`/`actual`, and an `auto_fixable` flag.
If a canopy call returns an unexpected error — `KeyError` from a state read, a "feature not found" for one you just created, a worktree path that should exist but doesn't — call `mcp__canopy__doctor` first. It reports 17 categories of state-file drift + install-staleness, each with `code`, `severity`, `expected`/`actual`, and an `auto_fixable` flag.

- `summary.errors == 0` → not a state problem; investigate the original error normally.
- Errors present, mostly `auto_fixable: true` → call `doctor(fix=True)`; report `fixed`/`skipped` to the user.
Expand Down
Loading
Loading