Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .springdrift_example/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,42 @@ summary_schedule = "weekly"
# max_turns = 10
# max_errors = 3

# ── Real-coder (OpenCode-backed coding agent — Phase 2) ─────────────────────
# When configured, the coder agent dispatches real coding work into a
# sandboxed OpenCode session running in the springdrift-coder image.
# Build the image with scripts/build-coder-image.sh and verify with
# scripts/smoke-coder-image.sh before relying on it.
[coder]
# image = "springdrift-coder:latest" # Pinned image tag (default: latest)
# project_root = "/Users/you/Repos/foo" # Host path bind-mounted r/w into the container (REQUIRED). MUST NOT contain .springdrift/. /tmp does not work on macOS podman.
# auth_config_path = "~/.config/opencode" # Host path bind-mounted ro at /root/.config/opencode (default: ~/.config/opencode)
# provider_id = "anthropic" # Provider OpenCode dispatches to (default: "anthropic")
# model_id = "claude-haiku-4-5-20251001" # Model OpenCode uses. Set to a model your API key has access to AND is in OpenCode's bundled catalog.

# ── Pool ──
# warm_pool_size = 1 # Containers pre-warmed at boot (default: 1)
# max_concurrent_sessions = 4 # Hard cap on parallel dispatches (default: 4)
# container_idle_ttl_ms = 3600000 # Janitor reaps idle containers older than this (default: 1h)
# container_name_prefix = "springdrift-coder" # Container name prefix (default: "springdrift-coder")
# slot_id_base = 100 # Slot id allocator base; counter increments monotonically (default: 100)

# ── Container resource limits — kernel-enforced ──
# container_memory_mb = 2048 # Memory cap per container (default: 2048)
# container_cpus = "2" # CPU quota per container (default: "2")
# container_pids_limit = 256 # Process count cap; fork-bomb safety (default: 256)

# ── Budget — defaults applied when dispatcher doesn't specify; ceilings enforce a hard wall ──
[coder.budget]
# default_max_tokens_per_task = 200000 # Default per-task token cap (default: 200000)
# default_max_cost_per_task_usd = 5.0 # Default per-task USD cap (default: 5.0)
# default_max_minutes_per_task = 10 # Default per-task wall-clock cap (default: 10 min)
# default_max_turns_per_task = 20 # Default per-task turn count cap (default: 20)
# ceiling_max_tokens_per_task = 1000000 # Hard wall — agent can't request beyond this (default: 1000000)
# ceiling_max_cost_per_task_usd = 50.0 # Hard wall (default: 50.0)
# ceiling_max_minutes_per_task = 60 # Hard wall (default: 60 min)
# ceiling_max_turns_per_task = 100 # Hard wall (default: 100)
# max_cost_per_hour_usd = 100.0 # Aggregate cap across all coder sessions per rolling hour (default: 100.0)

[agents.writer]
# max_tokens = 4096
# max_turns = 5
Expand Down
2 changes: 1 addition & 1 deletion .springdrift_example/skills/HOW_TO.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ When verification fails, use `recall_cases` to check if this is a known pattern.
- **agent_researcher** — web research and fact gathering (web + artifact + builtin tools)
- **agent_planner** — pure plan reasoning: task decomposition, steps, dependencies, risks (no tools, XML output, max 5 turns)
- **agent_project_manager** — full work management: endeavours, phases, sessions, blockers, forecaster config, task/endeavour editing and deletion (24 planner tools incl. complete_task_step, max 8 turns)
- **agent_coder** — code writing, debugging, refactoring (builtin tools, max 10 turns)
- **agent_coder** — orchestrates code edits via the OpenCode-backed sandbox. Has `project_status`/`read`/`grep` (host-side), `dispatch_coder`/`cancel_coder_session`/`list_coder_sessions` (delegate to OpenCode), plus Group A planner integration (`get_task_detail`, `complete_task_step`, `flag_risk`, `report_blocker`). Max 20 turns. See the `coder-delegation` skill for the framing/dispatch/verify loop. Only registered when `[coder]` is fully configured (image, project_root, model_id, ANTHROPIC_API_KEY); without that, the cog/PM still has `dispatch_coder` directly but no specialist coder agent.
- **agent_writer** — long-form writing, structured reports, drafts via document library, PDF export via pandoc + tectonic (knowledge draft tools + `export_pdf` + artifacts + builtin, max 5 turns). See the `writer-pdf-export` skill for when to call `export_pdf` and what its failure modes mean.
- **agent_observer** — diagnostic memory examination, CBR curation (18 diagnostic tools, max 6 turns)
- **agent_comms** — email send/receive via AgentMail (comms tools, max 6 turns, requires `comms_enabled`)
Expand Down
126 changes: 126 additions & 0 deletions .springdrift_example/skills/coder-delegation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
name: coder-delegation
description: How the Coder agent drives an OpenCode session via dispatch_coder — the engineering loop, when to dispatch, and how to verify the result is real (vs. claimed).
agents: coder, cognitive
---

## Your shape

You are an orchestrator, not an executor. The Project Manager dispatches a task to you. You drive a separate coding agent (OpenCode) running in a sandboxed container with the project bind-mounted in. You give it a brief; it edits files and commits autonomously. You verify the work afterwards from the host.

This shape exists because the in-container model is fast and capable but not trustworthy on its own. It will say "Done!" when tests are still failing. Your job is to be the layer that catches that.

## The four phases

### 1. Frame — before you dispatch

```
get_task_detail(task_id) → see the steps Planner produced
project_status → branch, dirty count, untracked
project_grep(pattern) → locate symbols/files involved
project_read(path) → see current state of files in scope
```

Decide what files matter. Write a short brief: *"task X, files A/B/C, success means tests pass and Y behaviour observable"*. Don't dispatch with hand-wavy "do the thing" briefs — they produce hand-wavy results.

### 2. Dispatch — hand off to OpenCode

```
dispatch_coder(brief="<your full brief>")
```

The brief is the entire instruction the in-container model receives. Include:
- What the task is, in plain prose
- Which files to look at
- What the success criteria are (tests passing, specific behaviour, etc.)
- Constraints (don't touch X, follow style Y)

Don't tell it HOW — tell it WHAT. The in-container model has its own tools (read, edit, bash, grep, gh, git) and its own iteration loop. It will plan, edit, run tests, refine, and commit on its own. `dispatch_coder` returns when that whole session is done.

You can also pass per-task budget overrides if the default isn't enough:
```
dispatch_coder(brief="...", max_tokens=400000, max_minutes=20)
```
The manager clamps each value against the operator's ceiling and reports the clamp in the response.

While `dispatch_coder` is in flight, the cog loop stays responsive — you can call `list_coder_sessions` to see what's running, and `cancel_coder_session(session_id)` if you decide to abort.

### 3. Inspect — verify on disk

This is where you earn your keep.

```
project_status → what files actually changed
project_read(path) → spot-check the change matches the brief
```

The dispatch response includes the model's natural-language summary, stop_reason, tokens, cost, and any budget clamps applied. That summary is the model's CLAIM. The on-disk state is the EVIDENCE. Treat them as different things.

If the change isn't what you wanted, dispatch again with a tighter follow-up brief. If the model bailed (stop_reason: max_tokens / cancelled / refusal), surface that — don't pretend the work landed.

### 4. Land or escalate

**Happy path** — the change matches the brief and the project still looks healthy:
```
complete_task_step(task_id, step_index)
```
Then return your summary using the Changed/Verified/Unverified structure.

**Blocked** — repeated dispatches didn't fix it, or the model couldn't proceed:
```
report_blocker(endeavour_id, description, requires_human=False)
```
Then return — PM will see the blocker.

**Risk materialised** — a planned risk actually happened:
```
flag_risk(task_id, risk_id, evidence)
```
Then continue or escalate depending on whether you can recover.

## Multi-turn dispatch

For genuinely large changes, dispatch multiple times. Each call is one OpenCode session — internally the model can do many edits and tool calls, but Springdrift sees one round-trip per dispatch. The natural breakpoints:

- **First dispatch** — the bulk of the work. Give it the full brief, the files in scope, the success criteria.
- **Follow-up dispatch** — only if your `project_status` / `project_read` inspection reveals something the first session missed or got wrong. Keep the brief tight: "the test in foo_test.gleam still fails with X; fix the cause in bar.gleam".

Don't dispatch in tight loops "in case the model needs another nudge". One dispatch should be one focused unit of work. If you find yourself reaching for a fourth or fifth dispatch on the same task, the brief was wrong — `report_blocker` instead.

## Reading the dispatch response

| Field | What it tells you |
|---|---|
| `stop_reason: end_turn` | Model thinks it's done. Verify on disk. |
| `stop_reason: max_tokens` | Hit the token budget mid-work. Likely incomplete. |
| `stop_reason: cancelled` | You or the budget-cap killed the session. |
| `stop_reason: refusal` | Model refused for safety reasons. Read response_text for why. |
| `tokens` / `cost_usd` | Resource consumption. Compare to the budget. |
| `budget clamps applied` | Your max_* request was lowered. The session ran with the clamped value. |
| `response_text` | Model's natural-language summary. Optimistic — verify on disk. |

## Required response structure

End every dispatch reply with:

```
Changed: <what landed on disk, by file or commit, observed via project_status/project_read>
Verified: <what you confirmed, citing which tool output confirms it>
Unverified: <what dispatch_coder claimed but you did not confirm, with reason>
```

If `Unverified` is non-empty, you have NOT finished. Say so.

## Common failure modes

- **The model went off-topic.** It decided to refactor unrelated code. The first dispatch landed too much. Surface what's there in your reply, and ask the operator (or your delegator) whether to keep or revert before the next dispatch.
- **Same failure pattern across two dispatches.** The model doesn't understand the cause. Don't dispatch a third time hoping for luck. `report_blocker` with the failure output and return.
- **`real-coder mode not configured`** from `dispatch_coder`. Operator hasn't set up `[coder]` or `ANTHROPIC_API_KEY`. Surface verbatim — it's a config issue, not yours to fix.
- **Cost budget exceeded mid-session**. The dispatch returned with `stop_reason: cancelled` and a partial result. Either raise `max_cost_usd` on the next call (within the operator ceiling) or split the work into smaller dispatches.

## What you do NOT do

- You do not edit project files directly. Only the OpenCode session does.
- You do not run host-side test/build/format commands. The in-container model runs whatever it needs (it has bash + the project tree). Your verification is structural: `project_status` shows what's dirty, `project_read` shows what's in a file.
- You do not push commits anywhere. The coder commits locally; pushing is the operator's call.
- You do not decide whether the task is "done enough" — the success criteria came from the Planner. If they're met, you're done. If they're not, iterate or escalate.
37 changes: 36 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,13 @@ src/
│ ├── podman_ffi.gleam FFI declarations for subprocess execution (run_cmd, which)
│ └── diagnostics.gleam Startup checks: podman version, machine status, image pull
├── coder/ Real-coder execution layer (OpenCode-backed via ACP)
│ ├── types.gleam CoderConfig, CoderError, TaskBudget, BudgetClamp, DispatchResult, SessionSummary, format_error/1
│ ├── circuit.gleam Pure token/cost circuit breaker — per-task + rolling-hour caps
│ ├── acp.gleam JSON-RPC-over-stdio Agent Client Protocol bindings: open/initialize/session_new/session_prompt(_async)/session_cancel/close + AcpEvent stream + pure decoders pinned against probed shapes
│ ├── manager.gleam OTP actor: warm container pool, allocate per dispatch, spawn driver, route cancels, per-task budget enforcement, three-stage kill chain. dispatch_task/3 is the public entry; cog-loop async-wraps it via the worker pattern in tools/coder_dispatch.gleam
│ └── ingest.gleam Coder session → CbrCase (CodePattern, tools_used = ACP tool titles) + raw-JSON archive under .springdrift/memory/coder/sessions/
├── tui.gleam Alternate-screen TUI; Chat + Log + Narrative tabs
├── web/ Web chat GUI + admin dashboard
Expand Down Expand Up @@ -262,6 +269,23 @@ gleam format # Format all source files
gleam build # Compile only
```

### Coder sandbox image scripts (Phase 2 onwards)

The OpenCode-backed coder agent runs in a dedicated sandbox image
(`springdrift-coder:<version>`). The image and its tooling are
operator-controlled — Springdrift never auto-builds or auto-pulls.

```sh
scripts/build-coder-image.sh # Build the pinned image
scripts/smoke-coder-image.sh # Verify the pinned opencode version starts headless and serves /app
scripts/discover-coder-endpoints.sh # Probe the running container's endpoint surface (drives client design + version-bump diffs)
scripts/vendor-opencode-spec.sh # Save the OpenAPI spec under docs/vendor/opencode-<version>-openapi.json
scripts/e2e-coder.sh # End-to-end Phase 2 test: real container + real Anthropic + real "say pong" round-trip (~$0.001/run)
```

Pin-and-lag policy: bump `OPENCODE_VERSION` in `Containerfile.coder`,
rebuild, run smoke. Only after smoke passes is the new pin usable.

## Code quality requirements

### Tests must pass
Expand Down Expand Up @@ -456,6 +480,15 @@ All fields are `Option` types. Defaults are applied in `springdrift.gleam`.
| `sandbox_port_stride` | — | 100 | Host port stride per slot |
| `sandbox_ports_per_slot` | — | 5 | Ports forwarded per slot |
| `sandbox_auto_machine` | — | True | Auto-start podman machine on macOS |
| `coder_image` | — | None | Image tag for the OpenCode-backed coder slot. Set to `springdrift-coder:<version>` after `scripts/build-coder-image.sh`. |
| `coder_project_root` | — | None | Host path bind-mounted as the project root inside the coder slot. Required for real-coder use. Cannot contain `.springdrift/`. |
| `coder_session_timeout_ms` | — | 600000 | Hard ceiling on a single coding task's wall time (10 min). |
| `coder_max_tokens_per_task` | — | 200000 | Token budget per coding task — circuit breaker kills the session above this. Currently inert on synchronous path; live in Phase 4 SSE wiring. |
| `coder_max_cost_per_task_usd` | — | 5.0 | Cost budget (USD) per coding task. |
| `coder_max_cost_per_hour_usd` | — | 20.0 | Aggregate cost cap (USD) across all coder tasks per rolling hour. |
| `coder_cost_poll_interval_ms` | — | 5000 | How often supervisor polls session usage to feed the circuit breaker. |
| `coder_provider_id` | — | None | Provider id passed to OpenCode (e.g. "anthropic"). |
| `coder_model_id` | — | None | Model id passed to OpenCode. Operator must set to a model in BOTH OpenCode's bundled models.dev catalog AND their API key's allowed list — these drift over time. |
| `vertex_project_id` | — | None | GCP project ID (required for vertex provider) |
| `vertex_location` | — | "europe-west1" | GCP location / region |
| `vertex_endpoint` | — | derived from location | Vertex AI endpoint hostname (e.g. `europe-west1-aiplatform.googleapis.com`) |
Expand Down Expand Up @@ -507,6 +540,7 @@ indexed in ETS by the Librarian actor for fast queries.
| DAG nodes | (in-memory ETS, populated from cycle log) | `CycleNode` | Operational telemetry: token counts, tool calls, D' gates, agent output per cycle |
| Comms | `.springdrift/memory/comms/YYYY-MM-DD-comms.jsonl` | `CommsMessage` | Sent and received email messages with delivery status |
| Consolidation | `.springdrift/memory/consolidation/YYYY-MM-DD-consolidation.jsonl` | `ConsolidationRun` | Remembrancer run records: period, counts, report path |
| Coder sessions | `.springdrift/memory/coder/sessions/<session_id>.json` | OpenCode session export | Phase 4 ingestion archives every completed coder dispatch as raw conversation JSON for forensics + replay. CBR cases derived from these via `coder/ingest.gleam` |
| Strategies | `.springdrift/memory/strategies/YYYY-MM-DD-strategies.jsonl` | `StrategyEvent` | Meta-learning Phase A. Append-only Created/Used/Outcome/Archived events; `Strategy` derived by replay |
| Learning Goals | `.springdrift/memory/learning_goals/YYYY-MM-DD-goals.jsonl` | `GoalEvent` | Meta-learning Phase C. Append-only Created/EvidenceAdded/StatusChanged events; `LearningGoal` derived by replay |

Expand Down Expand Up @@ -721,7 +755,7 @@ exposes this (and other system state) to the LLM.
| Planner | none (XML output) | 5 | unlimited | Permanent | Pure reasoning: plan decomposition, steps, dependencies, risk identification |
| Project Manager | planner (22 tools) | 8 | unlimited | Permanent | Full work management: tasks, endeavours, phases, sessions, blockers, forecaster |
| Researcher | web + artifacts + builtin | 8 | 30 | Permanent | Gather information via search and extraction |
| Coder | builtin | 10 | unlimited | Permanent | Write and modify code, fix errors |
| Coder | planner Group A + project_status/read/grep + dispatch_coder/cancel_coder_session/list_coder_sessions + builtin | 20 | unlimited | Permanent | Frame the work via project_status/read/grep, delegate the actual edits via dispatch_coder (one OpenCode session per call, async-safe via the cog's worker pattern), verify on disk, land/escalate via complete_task_step/flag_risk/report_blocker. Only registered when `[coder]` is fully configured (image + project_root + model_id + ANTHROPIC_API_KEY). See the `coder-delegation` skill. |
| Writer | knowledge (drafts) + artifacts + builtin | 5 | unlimited | Permanent | Draft structured reports; create/update/promote drafts via document library; render approved exports to PDF via `export_pdf` (pandoc + tectonic) |
| Observer | diagnostic + CBR curation (18 tools) | 6 | 20 | Transient | Cycle forensics, pattern detection, CBR curation, fact tracing, D' feedback |
| Comms | comms (4 tools) | 6 | 20 | Permanent | Send and receive email via AgentMail |
Expand Down Expand Up @@ -1380,6 +1414,7 @@ CLI flags override config files. `--skills-dir` is repeatable and appends to the
├── planner-patterns/ Planner + cognitive: task decomposition patterns
├── planner-management/ Planner + cognitive: forecaster introspection, feature tuning, endeavour lifecycle
├── code-review/ Coder: sandbox patterns and common failure modes
├── coder-delegation/ Coder + cognitive: the engineering loop in real-coder mode (plan → dispatch → iterate → verify → land/escalate)
├── web-research/ Researcher + cognitive: web tool selection decision tree
└── shell-sandbox/ Coder: Docker sandbox usage guide
```
Expand Down
Loading
Loading