Reef v2 orchestration control plane and operator UX by pranavpatilsce · Pull Request #26 · hdresearch/reef

pranavpatilsce · 2026-03-28T04:09:21Z

Summary

This PR lands the Reef v2 orchestration control plane and operator UX.

At a product level, it turns Reef from a thin chat wrapper into an active fleet supervisor with:

a single vm-tree-backed source of truth
recursive orchestration across root, lieutenants, agent VMs, swarm workers, and resource VMs
explicit authority / coordination / synchronization primitives
durable usage, logs, scheduling, and lineage views
a chat-first operator UI with persistent fleet world state

This PR also includes the latest runtime and prompt-contract tightening needed to make repo-oriented implementation safer:

stronger root task startup handling
hardened GitHub repo preparation
safer root deployment defaults
clearer ownership and delegation defaults for non-trivial repo builds

What changed

Fleet state and orchestration model

unified fleet state under vm-tree
removed the old registry service from the active Reef surface
added explicit VM categories:
- infra_vm
- lieutenant
- agent_vm
- swarm_vm
- resource_vm
fixed nested lineage so spawned descendants keep their true parentage instead of flattening to root
split operational views from history:
- active-by-default fleet/tree views
- explicit history-inclusive lineage views
kept running resource_vm infrastructure visible in active views even when its parent subtree becomes historical

Communication and coordination

added first-class communication semantics:
- reef_signal upward
- reef_command downward
- reef_peer_signal for same-parent sibling coordination
enforced hierarchical command authority while allowing bounded peer coordination
added store coordination primitives:
- reef_store_list for discovery across namespaced keys
- reef_store_wait for barriers, waits, and synchronization
made exact-key waits namespace-transparent across agent-owned keys for cross-agent coordination
added scheduled orchestration primitives:
- reef_schedule_check
- reef_scheduled
- reef_cancel_scheduled
removed reminder timers from the active orchestration surface and moved follow-up toward scheduled checks

Runtime and lifecycle behavior

made lieutenant behavior durable by default instead of disposable-worker by default
kept root supervision bounded to turns while preserving continuity across turns
preserved post-mortem access to stopped child logs
fixed swarm completion surfacing into the main signal plane
fixed swarm wait semantics so it resolves on active workers instead of hanging on idle siblings
kept resource_vm protected-by-default and out of token/cost accounting
added explicit child lifecycle tasking for alive idle agent VMs
hardened post-task disposition handling and target semantics

Usage, logs, and observability

added lineage-aware usage rollups across lieutenants, agent VMs, and swarm workers
ensured stopped descendants still contribute to subtree usage
kept resource_vm at zero usage
built a real log browser:
- keyword search
- date range search
- agent filter
- level filter
- count and totalCount
routed the logs panel through the UI auth proxy so browser usage works correctly
added the persistent reef world state section above the composer
made that section more live and useful:
- SSE-assisted refresh
- polling fallback
- partial-failure tolerance
- scrollable cards
- receiving-card fallback when inbox is empty

Root deployment safety and service runtime

added a root task startup watchdog with a bounded retry so tasks fail explicitly instead of silently hanging
hardened reef_git_prepare for real repo work
made service reload handle symlinked service directories consistently
improved /services/deploy diagnostics so path and service-root mistakes are legible
gated Reef-root service deploys behind explicit control-plane intent

AGENTS and skills

slimmed AGENTS.md down to the environment contract and moved procedures into skills
added/updated skills for:
- decomposition
- code delivery
- GitHub repo preparation
- app deployment outside Reef root
- root supervision
- reporting/checkpointing
- service creation
tightened repo-build behavior so:
- repo-local guidance is read early
- product/application work defaults outside Reef root
- non-trivial repo builds assign ownership earlier
- root is less likely to silently reclaim child-owned work

Why this matters

Before this change, Reef had split control-plane concepts and too much orchestration state lived in fragile conventions.

After this change:

fleet state is legible
recursive orchestration is real, not aspirational
coordination primitives are explicit
history and active operations are separated cleanly
root behaves more like an operator and less like a passive transcript bot
the UI gives a live operational picture instead of just a chat feed
repo implementation work is safer and less likely to mutate Reef root by accident

Validation

This branch was exercised with repeated local tests and private live reprovisions.

Key local coverage

tests/authority.test.ts
tests/lieutenant.test.ts
tests/swarm-runtime.test.ts
tests/usage.test.ts
tests/store.test.ts
tests/scheduled.test.ts
tests/logs-search.test.ts
tests/probe.test.ts
tests/vm-tree-history.test.ts
src/reef.test.ts
tests/github.test.ts
services/services/services.test.ts

Key live validations

Live runs confirmed:

nested lineage is preserved and no longer flattened to root
usage rollups match the recorded tree
sibling peer signaling works for agent siblings and swarm siblings
store discovery and barrier waits work in real coordination flows
scheduled checks are usable and cleaner than reminders
root turn completion is bounded and no longer stays stuck in running
lieutenant remains durable after done
active/history views behave correctly in practice
protected resource_vm behavior matches policy
the logs browser works live
the reef world state section is populated and useful
repo implementation defaults outside Reef root in successful benchmark runs
resource VM + lieutenant maintenance patterns work in practice

Follow-ups / non-blocking issues

root still sometimes over-implements before delegating on complex repo builds
deployment ownership reclaim should become more explicit when root takes over from a blocked child
service/runtime operator surfaces will benefit from a dedicated Services tab later

Cross-repo context

This PR is part of a coordinated stack with:

hdresearch/vers-fleets branch reef-v2-orchestration
hdresearch/pi-vers branch remote-bg-process-hardening

Those PRs should be reviewed alongside this one because they provide the provisioning and runtime substrate Reef depends on.

… permissions Slice 1 of the reef v2 orchestration spec. Core infrastructure changes: Unified SQLite schema (services/vm-tree/store.ts): - Single fleet.sqlite replaces registry.sqlite, vms.sqlite, lieutenants.sqlite - 7 tables: vm_tree, signals, agent_events, logs, store, store_history - vm_tree has full v2 identity: category, context, directive, model, effort, grants, RPC state, snapshots, rewind lineage - Name uniqueness enforcement among active VMs - Fleet status live query Signals service (services/signals/): - Bidirectional: upward signals (done, blocked, failed, progress, need-resources, checkpoint) + downward commands (abort, pause, resume, steer) - Tools: reef_signal (send up), reef_command (send down), reef_inbox (unified inbox with direction/type/from filters, auto-acknowledge on read) - Event bus integration, debug panel Store migration (services/store/): - JSON file (data/store.json) → SQLite store + store_history tables - Every write versioned in store_history with VM lineage tracking - Auto-migrates from JSON on first init - Adds GET /:key/history route Category-based permissions (src/extension.ts): - Replaces binary REEF_CHILD_AGENT flag with REEF_CATEGORY-based service selection - infra_vm=all, lieutenant=7 services, agent_vm=5, swarm_vm=5, resource_vm=none - Backward compat: old env vars still resolve correctly Spawn flow updates: - Lieutenant and swarm spawns inject REEF_CATEGORY, VERS_AGENT_NAME, REEF_PARENT_VM_ID, REEF_ROOT_VM_ID - V1 env vars kept for backward compat during transition Base AGENTS.md: - Replaced with v2 universal AGENTS.md — covers tools, signals, operating principles, behavioral rules, model selection, result reporting

… anthropic Provider selection across root, lieutenant, and swarm spawn: - If LLM_PROXY_KEY exists → use vers provider (preferred) - If ANTHROPIC_API_KEY starts with sk-ant- → use anthropic provider (fallback) - Default → vers ANTHROPIC_API_KEY propagation to workers: - Prefer direct anthropic key (sk-ant-*) if no vers proxy key exists - Fallback chain: vers proxy key → direct anthropic key → LLM_PROXY_KEY

… + bug fixes AGENTS.md inheritance (src/core/agents-md.ts): - Shared utility: readParentAgentsMd, buildChildAgentsMd, buildAgentsMdWriteScript - Swarm and lieutenant spawn flows copy parent AGENTS.md to child VM - Context appended as "## Context from <parent-name>" section - Context flows through spawn params → event → vm_tree reef_agent_spawn tool: - Spawns autonomous agent VM with task (required), context, directive, model - Category set to agent_vm from the start (no PATCH race) - Task auto-sent on spawn — agents start working immediately reef_fleet_status tool: - Live view of direct children: name, category, status, model, last signal, context - Fleet-wide summary (alive VMs, categories) Bug fixes: - Categories: category flows through spawn → event → vm_tree (no more swapped labels) - Status lifecycle: swarm:agent_spawned fires before swarm:agent_ready (row exists for update) - Status on done/failed: signals service updates sender vm_tree status to stopped - Stale signals: old signals auto-acknowledged when spawning agent with reused name - Spawn guard: accepts ANTHROPIC_API_KEY as alternative to LLM_PROXY_KEY - Provider auto-detect: vers preferred, anthropic fallback (root + lieutenant + swarm)

Logs service (services/logs/): - reef_log tool: write structured log entries (level, category, message, metadata) - reef_logs tool: read own or another agent's logs (cross-agent for handoff) - Auto-logging: registerBehaviors taps tool_call and tool_result RPC events - Routes: POST /logs, GET /logs (with filters), GET /_panel - All agents get logs service in extension filtering reef_checkpoint tool: - Snapshots VM via vers_vm_commit, records commit in vm_tree - Emits checkpoint signal to parent with commitId and message reef_resource_spawn tool: - Spawns bare metal VM from golden/base image - Registers as resource_vm in vm_tree, returns SSH address - No agent stack, no punkin — just a Linux box Inbox behavior timer: - Signals service polls inbox every 10 seconds - Emits reef:signal:done/failed/blocked events on the extension bus AGENTS.md system prompt loading: - Root reads AGENTS.md on fresh tree via readParentAgentsMd() - Workers get --system-prompt flag pointing to inherited AGENTS.md - Lieutenant uses AGENTS.md when v2 agentsMd path provided - Model ID fix: claude-haiku-4-5 → claude-haiku-4-5-20251001 Provider cleanup: - Removed anthropic fallback — vers is the only provider - LLM_PROXY_KEY required for all spawns - Error message directs users to add credits at vers.sh

buildChildAgentsMd always prepends "## Context from <parent-name>" regardless of whether the context string starts with "##". This ensures the context chain is always traceable and follows the spec format.

Fix 1 — VERS_AGENT_DIRECTIVE: - Swarm buildWorkerEnv exports VERS_AGENT_DIRECTIVE when directive provided - Lieutenant buildRemoteEnv exports VERS_AGENT_DIRECTIVE when directive provided - reef_agent_spawn passes directive through spawn API to env injection - SpawnParams, routes, and tool all pass directive end-to-end - Verified: agent reads directive via bash, reports back correctly Fix 2 — Remove v1 env vars (clean break): - Removed REEF_CHILD_AGENT='true' from swarm and lieutenant spawn flows - Removed VERS_AGENT_ROLE='lieutenant'/'worker' from spawn flows - Only REEF_CATEGORY is set now - Category passed through buildWorkerEnv for agent_vm vs swarm_vm - Updated test to expect REEF_CATEGORY instead of VERS_AGENT_ROLE

…rcement Items 3-7 from spec drift audit: - Effort/thinkingLevel wired through set_model RPC (root=high, swarm/lt passthrough) - Baseline snapshot (versClient.commit after spawn) + completion agent_event - Auto-trigger root task on failed/blocked signals (POST /reef/submit) - Grants enforcement — github repo/profile scope, store key namespacing - FleetClient sends X-Reef-Agent-Name/Category/VM-ID headers on all API calls - Server-side store namespace enforcement in PUT/DELETE routes (not just client-side)

The behavior timer polls reef_inbox every 10 seconds (code: signals/index.ts setInterval 10_000), not 30 seconds as previously documented.

…, orphan sweep All 4 spawn paths (swarm_vm, agent_vm, lieutenant, resource_vm) now: - Register in vm_tree with status "creating" immediately after VM creation - Clean up leaked VMs on failure (delete VM + mark vm_tree error) - Validate AGENTS.md and env vars via SSH read-back after injection - Return structured SpawnResult with per-agent ok/error and step name Orphan cleanup: 5-minute sweep for VMs stuck in "creating" status, exposed as POST /swarm/orphan-cleanup for manual triggering. SwarmRuntime and LieutenantRuntime now accept vmTreeStore for direct SQLite access instead of relying on async event handlers.

… forwarding - Add signals, logs, swarm, cron to LIVE_REFRESH_PANELS (tabs now auto-refresh) - Remove store from SKIP_PANELS (store tab now appears in dashboard) - Forward GITHUB_TOKEN env var to spawned agents (swarm + lieutenant paths)

- Remove v1 tabs: registry, lieutenant, swarm, commits, docs, services - vm-tree tab renamed to "fleet" (the single fleet view) - v2 tabs: fleet, signals, logs, store, github, cron (sorted) - ALL panels auto-refresh every 5s (no whitelist, no stale state) - Remove LIVE_REFRESH_PANELS whitelist — everything is live

v2 clean break: - Remove v1 tabs: registry, lieutenant, swarm, docs, services - Keep: fleet (vm-tree), signals, logs, store, commits, github, cron - Tabs sorted in logical order with friendly labels - ALL panels auto-refresh every 2s (no whitelist) - Sticky table headers + scrollable panel area for long content

GitHub token access will be handled via the reef github service (vers GitHub App) instead of baking PATs into env vars.

Reef now only uses the vers provider. Removed resolveRootProvider(), maybeFallbackToAnthropic(), and all ANTHROPIC_API_KEY / REEF_MODEL_PROVIDER propagation from lieutenant RPC, swarm worker env, and persist-keys scripts. Credit exhaustion fails directly instead of switching providers.

pranavpatilsce added 30 commits March 26, 2026 11:56

fix: always use standard context header in AGENTS.md inheritance

a8f485c

buildChildAgentsMd always prepends "## Context from <parent-name>" regardless of whether the context string starts with "##". This ensures the context chain is always traceable and follows the spec format.

fix: correct behavior timer interval in AGENTS.md (30s → 10s)

baadb0b

The behavior timer polls reef_inbox every 10 seconds (code: signals/index.ts setInterval 10_000), not 30 seconds as previously documented.

revert: remove GITHUB_TOKEN forwarding from spawn env

7119b98

GitHub token access will be handled via the reef github service (vers GitHub App) instead of baking PATs into env vars.

Fix root vm lifecycle and anthropic fallback

03c53f0

Unify fleet state under vm-tree

26e2125

Add peer coordination signals

5f67e86

Fix fleet status and signal lookup semantics

9965c27

Fix swarm wait and terminal rpc cleanup

cbdb554

Add store coordination and scheduled checks

799b020

Make scheduled checks condition-first and remove reminders

9ce9609

Refine fleet history views and operator UX

3a84fed

Route logs panel through UI auth proxy

682e8cc

Bound root supervision to conversation turns

a4081d7

Hide alternate provider fallback wording

f8212c5

Add chat-first mobile Reef UI

80cb612

Fix CI test regressions after mobile UI merge

6aaf584

Split AGENTS guidance into skills and add inbox wait

f76c666

Clarify swarm wait guidance in agent skills

98f08be

Refine scheduled delivery and inbox catch-up

67f27b7

Harden child lifecycle tasking and panel refresh

f3dfd11

pranavpatilsce added 5 commits March 31, 2026 12:46

Harden root deployment defaults and service runtime

349f0ce

Tighten root ownership defaults for repo builds

d5c8ea6

Stabilize scheduled wake tests

c485358

Update agent guidance and auth env handling

793f697

pranavpatilsce merged commit f3faaa1 into main Apr 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reef v2 orchestration control plane and operator UX#26

Reef v2 orchestration control plane and operator UX#26
pranavpatilsce merged 35 commits intomainfrom
feat/reef-v2-orchestration

pranavpatilsce commented Mar 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pranavpatilsce commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Fleet state and orchestration model

Communication and coordination

Runtime and lifecycle behavior

Usage, logs, and observability

Root deployment safety and service runtime

AGENTS and skills

Why this matters

Validation

Key local coverage

Key live validations

Follow-ups / non-blocking issues

Cross-repo context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pranavpatilsce commented Mar 28, 2026 •

edited

Loading