Skip to content

Reef v2 orchestration control plane and operator UX#26

Merged
pranavpatilsce merged 35 commits intomainfrom
feat/reef-v2-orchestration
Apr 6, 2026
Merged

Reef v2 orchestration control plane and operator UX#26
pranavpatilsce merged 35 commits intomainfrom
feat/reef-v2-orchestration

Conversation

@pranavpatilsce
Copy link
Copy Markdown
Contributor

@pranavpatilsce pranavpatilsce commented Mar 28, 2026

Summary

This PR lands the Reef v2 orchestration control plane and operator UX.

At a product level, it turns Reef from a thin chat wrapper into an active fleet supervisor with:

  • a single vm-tree-backed source of truth
  • recursive orchestration across root, lieutenants, agent VMs, swarm workers, and resource VMs
  • explicit authority / coordination / synchronization primitives
  • durable usage, logs, scheduling, and lineage views
  • a chat-first operator UI with persistent fleet world state

This PR also includes the latest runtime and prompt-contract tightening needed to make repo-oriented implementation safer:

  • stronger root task startup handling
  • hardened GitHub repo preparation
  • safer root deployment defaults
  • clearer ownership and delegation defaults for non-trivial repo builds

What changed

Fleet state and orchestration model

  • unified fleet state under vm-tree
  • removed the old registry service from the active Reef surface
  • added explicit VM categories:
    • infra_vm
    • lieutenant
    • agent_vm
    • swarm_vm
    • resource_vm
  • fixed nested lineage so spawned descendants keep their true parentage instead of flattening to root
  • split operational views from history:
    • active-by-default fleet/tree views
    • explicit history-inclusive lineage views
  • kept running resource_vm infrastructure visible in active views even when its parent subtree becomes historical

Communication and coordination

  • added first-class communication semantics:
    • reef_signal upward
    • reef_command downward
    • reef_peer_signal for same-parent sibling coordination
  • enforced hierarchical command authority while allowing bounded peer coordination
  • added store coordination primitives:
    • reef_store_list for discovery across namespaced keys
    • reef_store_wait for barriers, waits, and synchronization
  • made exact-key waits namespace-transparent across agent-owned keys for cross-agent coordination
  • added scheduled orchestration primitives:
    • reef_schedule_check
    • reef_scheduled
    • reef_cancel_scheduled
  • removed reminder timers from the active orchestration surface and moved follow-up toward scheduled checks

Runtime and lifecycle behavior

  • made lieutenant behavior durable by default instead of disposable-worker by default
  • kept root supervision bounded to turns while preserving continuity across turns
  • preserved post-mortem access to stopped child logs
  • fixed swarm completion surfacing into the main signal plane
  • fixed swarm wait semantics so it resolves on active workers instead of hanging on idle siblings
  • kept resource_vm protected-by-default and out of token/cost accounting
  • added explicit child lifecycle tasking for alive idle agent VMs
  • hardened post-task disposition handling and target semantics

Usage, logs, and observability

  • added lineage-aware usage rollups across lieutenants, agent VMs, and swarm workers
  • ensured stopped descendants still contribute to subtree usage
  • kept resource_vm at zero usage
  • built a real log browser:
    • keyword search
    • date range search
    • agent filter
    • level filter
    • count and totalCount
  • routed the logs panel through the UI auth proxy so browser usage works correctly
  • added the persistent reef world state section above the composer
  • made that section more live and useful:
    • SSE-assisted refresh
    • polling fallback
    • partial-failure tolerance
    • scrollable cards
    • receiving-card fallback when inbox is empty

Root deployment safety and service runtime

  • added a root task startup watchdog with a bounded retry so tasks fail explicitly instead of silently hanging
  • hardened reef_git_prepare for real repo work
  • made service reload handle symlinked service directories consistently
  • improved /services/deploy diagnostics so path and service-root mistakes are legible
  • gated Reef-root service deploys behind explicit control-plane intent

AGENTS and skills

  • slimmed AGENTS.md down to the environment contract and moved procedures into skills
  • added/updated skills for:
    • decomposition
    • code delivery
    • GitHub repo preparation
    • app deployment outside Reef root
    • root supervision
    • reporting/checkpointing
    • service creation
  • tightened repo-build behavior so:
    • repo-local guidance is read early
    • product/application work defaults outside Reef root
    • non-trivial repo builds assign ownership earlier
    • root is less likely to silently reclaim child-owned work

Why this matters

Before this change, Reef had split control-plane concepts and too much orchestration state lived in fragile conventions.

After this change:

  • fleet state is legible
  • recursive orchestration is real, not aspirational
  • coordination primitives are explicit
  • history and active operations are separated cleanly
  • root behaves more like an operator and less like a passive transcript bot
  • the UI gives a live operational picture instead of just a chat feed
  • repo implementation work is safer and less likely to mutate Reef root by accident

Validation

This branch was exercised with repeated local tests and private live reprovisions.

Key local coverage

  • tests/authority.test.ts
  • tests/lieutenant.test.ts
  • tests/swarm-runtime.test.ts
  • tests/usage.test.ts
  • tests/store.test.ts
  • tests/scheduled.test.ts
  • tests/logs-search.test.ts
  • tests/probe.test.ts
  • tests/vm-tree-history.test.ts
  • src/reef.test.ts
  • tests/github.test.ts
  • services/services/services.test.ts

Key live validations

Live runs confirmed:

  • nested lineage is preserved and no longer flattened to root
  • usage rollups match the recorded tree
  • sibling peer signaling works for agent siblings and swarm siblings
  • store discovery and barrier waits work in real coordination flows
  • scheduled checks are usable and cleaner than reminders
  • root turn completion is bounded and no longer stays stuck in running
  • lieutenant remains durable after done
  • active/history views behave correctly in practice
  • protected resource_vm behavior matches policy
  • the logs browser works live
  • the reef world state section is populated and useful
  • repo implementation defaults outside Reef root in successful benchmark runs
  • resource VM + lieutenant maintenance patterns work in practice

Follow-ups / non-blocking issues

  • root still sometimes over-implements before delegating on complex repo builds
  • deployment ownership reclaim should become more explicit when root takes over from a blocked child
  • service/runtime operator surfaces will benefit from a dedicated Services tab later

Cross-repo context

This PR is part of a coordinated stack with:

  • hdresearch/vers-fleets branch reef-v2-orchestration
  • hdresearch/pi-vers branch remote-bg-process-hardening

Those PRs should be reviewed alongside this one because they provide the provisioning and runtime substrate Reef depends on.

… permissions

Slice 1 of the reef v2 orchestration spec. Core infrastructure changes:

Unified SQLite schema (services/vm-tree/store.ts):
- Single fleet.sqlite replaces registry.sqlite, vms.sqlite, lieutenants.sqlite
- 7 tables: vm_tree, signals, agent_events, logs, store, store_history
- vm_tree has full v2 identity: category, context, directive, model, effort,
  grants, RPC state, snapshots, rewind lineage
- Name uniqueness enforcement among active VMs
- Fleet status live query

Signals service (services/signals/):
- Bidirectional: upward signals (done, blocked, failed, progress, need-resources,
  checkpoint) + downward commands (abort, pause, resume, steer)
- Tools: reef_signal (send up), reef_command (send down), reef_inbox (unified
  inbox with direction/type/from filters, auto-acknowledge on read)
- Event bus integration, debug panel

Store migration (services/store/):
- JSON file (data/store.json) → SQLite store + store_history tables
- Every write versioned in store_history with VM lineage tracking
- Auto-migrates from JSON on first init
- Adds GET /:key/history route

Category-based permissions (src/extension.ts):
- Replaces binary REEF_CHILD_AGENT flag with REEF_CATEGORY-based service selection
- infra_vm=all, lieutenant=7 services, agent_vm=5, swarm_vm=5, resource_vm=none
- Backward compat: old env vars still resolve correctly

Spawn flow updates:
- Lieutenant and swarm spawns inject REEF_CATEGORY, VERS_AGENT_NAME,
  REEF_PARENT_VM_ID, REEF_ROOT_VM_ID
- V1 env vars kept for backward compat during transition

Base AGENTS.md:
- Replaced with v2 universal AGENTS.md — covers tools, signals, operating
  principles, behavioral rules, model selection, result reporting
… anthropic

Provider selection across root, lieutenant, and swarm spawn:
- If LLM_PROXY_KEY exists → use vers provider (preferred)
- If ANTHROPIC_API_KEY starts with sk-ant- → use anthropic provider (fallback)
- Default → vers

ANTHROPIC_API_KEY propagation to workers:
- Prefer direct anthropic key (sk-ant-*) if no vers proxy key exists
- Fallback chain: vers proxy key → direct anthropic key → LLM_PROXY_KEY
… + bug fixes

AGENTS.md inheritance (src/core/agents-md.ts):
- Shared utility: readParentAgentsMd, buildChildAgentsMd, buildAgentsMdWriteScript
- Swarm and lieutenant spawn flows copy parent AGENTS.md to child VM
- Context appended as "## Context from <parent-name>" section
- Context flows through spawn params → event → vm_tree

reef_agent_spawn tool:
- Spawns autonomous agent VM with task (required), context, directive, model
- Category set to agent_vm from the start (no PATCH race)
- Task auto-sent on spawn — agents start working immediately

reef_fleet_status tool:
- Live view of direct children: name, category, status, model, last signal, context
- Fleet-wide summary (alive VMs, categories)

Bug fixes:
- Categories: category flows through spawn → event → vm_tree (no more swapped labels)
- Status lifecycle: swarm:agent_spawned fires before swarm:agent_ready (row exists for update)
- Status on done/failed: signals service updates sender vm_tree status to stopped
- Stale signals: old signals auto-acknowledged when spawning agent with reused name
- Spawn guard: accepts ANTHROPIC_API_KEY as alternative to LLM_PROXY_KEY
- Provider auto-detect: vers preferred, anthropic fallback (root + lieutenant + swarm)
Logs service (services/logs/):
- reef_log tool: write structured log entries (level, category, message, metadata)
- reef_logs tool: read own or another agent's logs (cross-agent for handoff)
- Auto-logging: registerBehaviors taps tool_call and tool_result RPC events
- Routes: POST /logs, GET /logs (with filters), GET /_panel
- All agents get logs service in extension filtering

reef_checkpoint tool:
- Snapshots VM via vers_vm_commit, records commit in vm_tree
- Emits checkpoint signal to parent with commitId and message

reef_resource_spawn tool:
- Spawns bare metal VM from golden/base image
- Registers as resource_vm in vm_tree, returns SSH address
- No agent stack, no punkin — just a Linux box

Inbox behavior timer:
- Signals service polls inbox every 10 seconds
- Emits reef:signal:done/failed/blocked events on the extension bus

AGENTS.md system prompt loading:
- Root reads AGENTS.md on fresh tree via readParentAgentsMd()
- Workers get --system-prompt flag pointing to inherited AGENTS.md
- Lieutenant uses AGENTS.md when v2 agentsMd path provided
- Model ID fix: claude-haiku-4-5 → claude-haiku-4-5-20251001

Provider cleanup:
- Removed anthropic fallback — vers is the only provider
- LLM_PROXY_KEY required for all spawns
- Error message directs users to add credits at vers.sh
buildChildAgentsMd always prepends "## Context from <parent-name>"
regardless of whether the context string starts with "##". This ensures
the context chain is always traceable and follows the spec format.
Fix 1 — VERS_AGENT_DIRECTIVE:
- Swarm buildWorkerEnv exports VERS_AGENT_DIRECTIVE when directive provided
- Lieutenant buildRemoteEnv exports VERS_AGENT_DIRECTIVE when directive provided
- reef_agent_spawn passes directive through spawn API to env injection
- SpawnParams, routes, and tool all pass directive end-to-end
- Verified: agent reads directive via bash, reports back correctly

Fix 2 — Remove v1 env vars (clean break):
- Removed REEF_CHILD_AGENT='true' from swarm and lieutenant spawn flows
- Removed VERS_AGENT_ROLE='lieutenant'/'worker' from spawn flows
- Only REEF_CATEGORY is set now
- Category passed through buildWorkerEnv for agent_vm vs swarm_vm
- Updated test to expect REEF_CATEGORY instead of VERS_AGENT_ROLE
…rcement

Items 3-7 from spec drift audit:
- Effort/thinkingLevel wired through set_model RPC (root=high, swarm/lt passthrough)
- Baseline snapshot (versClient.commit after spawn) + completion agent_event
- Auto-trigger root task on failed/blocked signals (POST /reef/submit)
- Grants enforcement — github repo/profile scope, store key namespacing
- FleetClient sends X-Reef-Agent-Name/Category/VM-ID headers on all API calls
- Server-side store namespace enforcement in PUT/DELETE routes (not just client-side)
The behavior timer polls reef_inbox every 10 seconds (code: signals/index.ts
setInterval 10_000), not 30 seconds as previously documented.
…, orphan sweep

All 4 spawn paths (swarm_vm, agent_vm, lieutenant, resource_vm) now:
- Register in vm_tree with status "creating" immediately after VM creation
- Clean up leaked VMs on failure (delete VM + mark vm_tree error)
- Validate AGENTS.md and env vars via SSH read-back after injection
- Return structured SpawnResult with per-agent ok/error and step name

Orphan cleanup: 5-minute sweep for VMs stuck in "creating" status,
exposed as POST /swarm/orphan-cleanup for manual triggering.

SwarmRuntime and LieutenantRuntime now accept vmTreeStore for direct
SQLite access instead of relying on async event handlers.
… forwarding

- Add signals, logs, swarm, cron to LIVE_REFRESH_PANELS (tabs now auto-refresh)
- Remove store from SKIP_PANELS (store tab now appears in dashboard)
- Forward GITHUB_TOKEN env var to spawned agents (swarm + lieutenant paths)
- Remove v1 tabs: registry, lieutenant, swarm, commits, docs, services
- vm-tree tab renamed to "fleet" (the single fleet view)
- v2 tabs: fleet, signals, logs, store, github, cron (sorted)
- ALL panels auto-refresh every 5s (no whitelist, no stale state)
- Remove LIVE_REFRESH_PANELS whitelist — everything is live
v2 clean break:
- Remove v1 tabs: registry, lieutenant, swarm, docs, services
- Keep: fleet (vm-tree), signals, logs, store, commits, github, cron
- Tabs sorted in logical order with friendly labels
- ALL panels auto-refresh every 2s (no whitelist)
- Sticky table headers + scrollable panel area for long content
GitHub token access will be handled via the reef github service
(vers GitHub App) instead of baking PATs into env vars.
Reef now only uses the vers provider. Removed resolveRootProvider(),
maybeFallbackToAnthropic(), and all ANTHROPIC_API_KEY / REEF_MODEL_PROVIDER
propagation from lieutenant RPC, swarm worker env, and persist-keys scripts.
Credit exhaustion fails directly instead of switching providers.
@pranavpatilsce pranavpatilsce merged commit f3faaa1 into main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant