Reef v2 orchestration control plane and operator UX#26
Merged
pranavpatilsce merged 35 commits intomainfrom Apr 6, 2026
Merged
Conversation
… permissions Slice 1 of the reef v2 orchestration spec. Core infrastructure changes: Unified SQLite schema (services/vm-tree/store.ts): - Single fleet.sqlite replaces registry.sqlite, vms.sqlite, lieutenants.sqlite - 7 tables: vm_tree, signals, agent_events, logs, store, store_history - vm_tree has full v2 identity: category, context, directive, model, effort, grants, RPC state, snapshots, rewind lineage - Name uniqueness enforcement among active VMs - Fleet status live query Signals service (services/signals/): - Bidirectional: upward signals (done, blocked, failed, progress, need-resources, checkpoint) + downward commands (abort, pause, resume, steer) - Tools: reef_signal (send up), reef_command (send down), reef_inbox (unified inbox with direction/type/from filters, auto-acknowledge on read) - Event bus integration, debug panel Store migration (services/store/): - JSON file (data/store.json) → SQLite store + store_history tables - Every write versioned in store_history with VM lineage tracking - Auto-migrates from JSON on first init - Adds GET /:key/history route Category-based permissions (src/extension.ts): - Replaces binary REEF_CHILD_AGENT flag with REEF_CATEGORY-based service selection - infra_vm=all, lieutenant=7 services, agent_vm=5, swarm_vm=5, resource_vm=none - Backward compat: old env vars still resolve correctly Spawn flow updates: - Lieutenant and swarm spawns inject REEF_CATEGORY, VERS_AGENT_NAME, REEF_PARENT_VM_ID, REEF_ROOT_VM_ID - V1 env vars kept for backward compat during transition Base AGENTS.md: - Replaced with v2 universal AGENTS.md — covers tools, signals, operating principles, behavioral rules, model selection, result reporting
… anthropic Provider selection across root, lieutenant, and swarm spawn: - If LLM_PROXY_KEY exists → use vers provider (preferred) - If ANTHROPIC_API_KEY starts with sk-ant- → use anthropic provider (fallback) - Default → vers ANTHROPIC_API_KEY propagation to workers: - Prefer direct anthropic key (sk-ant-*) if no vers proxy key exists - Fallback chain: vers proxy key → direct anthropic key → LLM_PROXY_KEY
… + bug fixes AGENTS.md inheritance (src/core/agents-md.ts): - Shared utility: readParentAgentsMd, buildChildAgentsMd, buildAgentsMdWriteScript - Swarm and lieutenant spawn flows copy parent AGENTS.md to child VM - Context appended as "## Context from <parent-name>" section - Context flows through spawn params → event → vm_tree reef_agent_spawn tool: - Spawns autonomous agent VM with task (required), context, directive, model - Category set to agent_vm from the start (no PATCH race) - Task auto-sent on spawn — agents start working immediately reef_fleet_status tool: - Live view of direct children: name, category, status, model, last signal, context - Fleet-wide summary (alive VMs, categories) Bug fixes: - Categories: category flows through spawn → event → vm_tree (no more swapped labels) - Status lifecycle: swarm:agent_spawned fires before swarm:agent_ready (row exists for update) - Status on done/failed: signals service updates sender vm_tree status to stopped - Stale signals: old signals auto-acknowledged when spawning agent with reused name - Spawn guard: accepts ANTHROPIC_API_KEY as alternative to LLM_PROXY_KEY - Provider auto-detect: vers preferred, anthropic fallback (root + lieutenant + swarm)
Logs service (services/logs/): - reef_log tool: write structured log entries (level, category, message, metadata) - reef_logs tool: read own or another agent's logs (cross-agent for handoff) - Auto-logging: registerBehaviors taps tool_call and tool_result RPC events - Routes: POST /logs, GET /logs (with filters), GET /_panel - All agents get logs service in extension filtering reef_checkpoint tool: - Snapshots VM via vers_vm_commit, records commit in vm_tree - Emits checkpoint signal to parent with commitId and message reef_resource_spawn tool: - Spawns bare metal VM from golden/base image - Registers as resource_vm in vm_tree, returns SSH address - No agent stack, no punkin — just a Linux box Inbox behavior timer: - Signals service polls inbox every 10 seconds - Emits reef:signal:done/failed/blocked events on the extension bus AGENTS.md system prompt loading: - Root reads AGENTS.md on fresh tree via readParentAgentsMd() - Workers get --system-prompt flag pointing to inherited AGENTS.md - Lieutenant uses AGENTS.md when v2 agentsMd path provided - Model ID fix: claude-haiku-4-5 → claude-haiku-4-5-20251001 Provider cleanup: - Removed anthropic fallback — vers is the only provider - LLM_PROXY_KEY required for all spawns - Error message directs users to add credits at vers.sh
buildChildAgentsMd always prepends "## Context from <parent-name>" regardless of whether the context string starts with "##". This ensures the context chain is always traceable and follows the spec format.
Fix 1 — VERS_AGENT_DIRECTIVE: - Swarm buildWorkerEnv exports VERS_AGENT_DIRECTIVE when directive provided - Lieutenant buildRemoteEnv exports VERS_AGENT_DIRECTIVE when directive provided - reef_agent_spawn passes directive through spawn API to env injection - SpawnParams, routes, and tool all pass directive end-to-end - Verified: agent reads directive via bash, reports back correctly Fix 2 — Remove v1 env vars (clean break): - Removed REEF_CHILD_AGENT='true' from swarm and lieutenant spawn flows - Removed VERS_AGENT_ROLE='lieutenant'/'worker' from spawn flows - Only REEF_CATEGORY is set now - Category passed through buildWorkerEnv for agent_vm vs swarm_vm - Updated test to expect REEF_CATEGORY instead of VERS_AGENT_ROLE
…rcement Items 3-7 from spec drift audit: - Effort/thinkingLevel wired through set_model RPC (root=high, swarm/lt passthrough) - Baseline snapshot (versClient.commit after spawn) + completion agent_event - Auto-trigger root task on failed/blocked signals (POST /reef/submit) - Grants enforcement — github repo/profile scope, store key namespacing - FleetClient sends X-Reef-Agent-Name/Category/VM-ID headers on all API calls - Server-side store namespace enforcement in PUT/DELETE routes (not just client-side)
The behavior timer polls reef_inbox every 10 seconds (code: signals/index.ts setInterval 10_000), not 30 seconds as previously documented.
…, orphan sweep All 4 spawn paths (swarm_vm, agent_vm, lieutenant, resource_vm) now: - Register in vm_tree with status "creating" immediately after VM creation - Clean up leaked VMs on failure (delete VM + mark vm_tree error) - Validate AGENTS.md and env vars via SSH read-back after injection - Return structured SpawnResult with per-agent ok/error and step name Orphan cleanup: 5-minute sweep for VMs stuck in "creating" status, exposed as POST /swarm/orphan-cleanup for manual triggering. SwarmRuntime and LieutenantRuntime now accept vmTreeStore for direct SQLite access instead of relying on async event handlers.
… forwarding - Add signals, logs, swarm, cron to LIVE_REFRESH_PANELS (tabs now auto-refresh) - Remove store from SKIP_PANELS (store tab now appears in dashboard) - Forward GITHUB_TOKEN env var to spawned agents (swarm + lieutenant paths)
- Remove v1 tabs: registry, lieutenant, swarm, commits, docs, services - vm-tree tab renamed to "fleet" (the single fleet view) - v2 tabs: fleet, signals, logs, store, github, cron (sorted) - ALL panels auto-refresh every 5s (no whitelist, no stale state) - Remove LIVE_REFRESH_PANELS whitelist — everything is live
v2 clean break: - Remove v1 tabs: registry, lieutenant, swarm, docs, services - Keep: fleet (vm-tree), signals, logs, store, commits, github, cron - Tabs sorted in logical order with friendly labels - ALL panels auto-refresh every 2s (no whitelist) - Sticky table headers + scrollable panel area for long content
GitHub token access will be handled via the reef github service (vers GitHub App) instead of baking PATs into env vars.
Reef now only uses the vers provider. Removed resolveRootProvider(), maybeFallbackToAnthropic(), and all ANTHROPIC_API_KEY / REEF_MODEL_PROVIDER propagation from lieutenant RPC, swarm worker env, and persist-keys scripts. Credit exhaustion fails directly instead of switching providers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR lands the Reef v2 orchestration control plane and operator UX.
At a product level, it turns Reef from a thin chat wrapper into an active fleet supervisor with:
vm-tree-backed source of truthThis PR also includes the latest runtime and prompt-contract tightening needed to make repo-oriented implementation safer:
What changed
Fleet state and orchestration model
vm-treeinfra_vmlieutenantagent_vmswarm_vmresource_vmresource_vminfrastructure visible in active views even when its parent subtree becomes historicalCommunication and coordination
reef_signalupwardreef_commanddownwardreef_peer_signalfor same-parent sibling coordinationreef_store_listfor discovery across namespaced keysreef_store_waitfor barriers, waits, and synchronizationreef_schedule_checkreef_scheduledreef_cancel_scheduledRuntime and lifecycle behavior
resource_vmprotected-by-default and out of token/cost accountingUsage, logs, and observability
resource_vmat zero usagecountandtotalCountreef world statesection above the composerRoot deployment safety and service runtime
reef_git_preparefor real repo work/services/deploydiagnostics so path and service-root mistakes are legibleAGENTS and skills
AGENTS.mddown to the environment contract and moved procedures into skillsWhy this matters
Before this change, Reef had split control-plane concepts and too much orchestration state lived in fragile conventions.
After this change:
Validation
This branch was exercised with repeated local tests and private live reprovisions.
Key local coverage
tests/authority.test.tstests/lieutenant.test.tstests/swarm-runtime.test.tstests/usage.test.tstests/store.test.tstests/scheduled.test.tstests/logs-search.test.tstests/probe.test.tstests/vm-tree-history.test.tssrc/reef.test.tstests/github.test.tsservices/services/services.test.tsKey live validations
Live runs confirmed:
runningdoneresource_vmbehavior matches policyFollow-ups / non-blocking issues
Cross-repo context
This PR is part of a coordinated stack with:
hdresearch/vers-fleetsbranchreef-v2-orchestrationhdresearch/pi-versbranchremote-bg-process-hardeningThose PRs should be reviewed alongside this one because they provide the provisioning and runtime substrate Reef depends on.