feat: Issue #18 Phase 1 — autonomous agent orchestration baseline#50
Conversation
Deliver EPIC1 traced execution, EPIC3 tool policy/sandbox, and EPIC4 quality gates so agent workflows can run with traceability, bounded tool risk, and CI-style merge routing before shipping changes. Closes #22 Co-authored-by: Cursor <cursoragent@cursor.com>
stevei101
left a comment
There was a problem hiding this comment.
Code review
Overview
Phase 1 of the autonomous-agent orchestration roadmap (#18). Three new modules — execution (RunContext, TransitionLog, TracedRunner, ReplayRunner, StateValidator), tools (ToolPolicyEngine, SubprocessSandbox, ToolNodeConfig), guardrails (QualityGateNode, ReviewFinding, RiskClassifier, merge routing) — plus an example, docs, and a cargo test CI job. 2007 LOC, 22 files, 107 + 9 tests.
Solid baseline. The shape (per-module types, clean re-exports through prelude, dependency injection via traits for the command runner / sandbox executor) is sound and matches the existing project style. Issues below are mostly hygiene + a few real bugs.
Correctness — real bugs
-
truncate_logwill panic on multi-byte UTF-8.src/guardrails/gate.rs:fn truncate_log(s: &str, max: usize) -> String { if s.len() <= max { s.to_string() } else { format!("{}…", &s[..max]) } }
&s[..max]panics ifmaxlands inside a multi-byte char. Tool output (especiallycargowith emoji/colors) routinely contains UTF-8. Usechars().take(max).collect::<String>()orfloor_char_boundary. -
Policy
blocked_patternsbypassed forRequireApprovaltools.src/tools/policy.rs::evaluatechecksapproval_requiredbeforeblocked_patterns. So an approval-requiredshelltool whose command containssudoorrm -rf /will route through approval and (presumably) execute on human OK — but the human shouldn't even see those. Reorder: deny → blocked_patterns → approval_required → allow → fail-closed. -
QualityGateConfig.block_on_failureis dead. Set inrust_defaults()and the example, never read byQualityGateNode::execute. The node routes purely offmerge_blocker.blocked. Either honour the flag or drop the field. -
ReviewFinding::error(...).at("Cargo.toml", 1)is a fake location.src/guardrails/gate.rs::run_checksalways tags failed-check findings asCargo.toml:1regardless of which check failed. Misleading in PR comments. Drop the.at(...)call when there's no real location, or parse it out of the check output. -
ReplayRunnerdoesn't replay — it compares. Naming oversells.ReplayRunner::compare(expected, actual)is a log-diff. The actual primitive needed to "validate deterministic replay" is to re-run the graph from a recorded log + initial state and check the produced log matches. Consider renaming toTransitionLogDiffor wiring up an actual re-execution path (the stuboutput_from_kindhints at the intent but is dead-code).
Correctness — minor
-
Iteration counter type mismatch.
TracedRunner::run_loopusesiterations: u32;state.iterationisusize;TransitionRecordhas both (iteration: u32,state_iteration: usize). Pick one, propagate. -
resolve_nextsilently routesContinue(None)to END but warns on missingTransition(key)edge. Inconsistent. Either both should warn, or document whyContinue(None)ending the graph is silent-by-design. -
Policy
command_hintextraction assumesarguments["command"]. Many shell tools usescript,cmd,bash_cmd. Document the convention or take the arg name as a config field on the policy. -
SubprocessSandbox::validate_working_diruses blockingstd::fs::canonicalizeinside async. Tokio runtime won't deadlock for short paths but it's a smell — usetokio::fs::canonicalizeorspawn_blocking.
Architecture / convention
-
src/nodes/quality_gate.rsis just a re-export ofsrc/guardrails. Two import paths for the same types (crate::nodes::QualityGateNodeandcrate::guardrails::QualityGateNode). Pick one canonical location. Theguardrailshome reads better — built-in node modules become discoverable vianodesonly when they don't have a richer home. -
SubprocessSandboxis a subprocess runner with a timeout, not a sandbox. It does:- Timeout ✅
- cwd allowlist ✅ (but default = empty = any cwd)
sh -c "$cmd"— no fs/process/network isolation, classic shell-injection surface
Either rename to
SubprocessRunneror be explicit in the docstring that this is the absolute minimum and a "real" sandbox needs namespaces/seccomp/etc. The current name suggests a security boundary it doesn't provide. -
SandboxConfig::default()allows any cwd. A sandbox that defaults to "no cwd restriction" is barely a sandbox. Either tighten the default or rename the constructor topermissive. -
Asymmetric
ReviewFindingconstructors. Only::error. Add::infoand::warningfor symmetry, otherwise people will reach for struct-literal syntax which couples to internal fields. -
use super::transition::TransitionLog;afterTracedRunResultdefinition. Stylistic — Rust allows it, but conventional placement is at the top with otheruses. The current order made me re-read to confirm scope.
Performance
QualityGateNode::run_checksruns checks sequentially. For independent gates (fmt, clippy, test),tokio::join!/buffer_unorderedwould parallelise. Big win for the canonical Rust defaults — fmt is ~1s, clippy 30s+, test 60s+; they're independent.
Test coverage
- No test for the policy bypass bug (#2).
- No test for
SubprocessSandbox::validate_working_dir. - No test for
StateValidator's schema-based path — only therequire_keypath is exercised. ReplayRunnertests only test the comparator, not full replay. Reinforces the naming concern (#5).
CI
cargo testjob has nopaths:filter. Will run on doc-only PRs. Minor.
Security
-
The big one is #11 — "Sandbox" branding. If autonomous agents in the wild end up trusting
SubprocessSandboxbecause its name suggests isolation, the project may inherit incidents. Either ship real isolation primitives (rootless containers, seccomp profile, network namespace) or pick a name that doesn't claim more than it delivers. -
Policy
RequireApprovalhas no approval primitive yet. The PR description claims "approval routing"; the actual flow returns an error message inToolResult. There's no human-in-the-loop handshake. Worth flagging indocs/ROADMAP-18.mdor PR body what "approval routing" means vs. what's deferred.
Priority for follow-up
- Fix #1 (UTF-8 panic) — real crash risk.
- Fix #2 (policy ordering) — security correctness.
- Fix #4 (Cargo.toml:1 fake location) — quality of agent outputs.
- Resolve #11 (Sandbox naming/scope) — sets expectations for downstream consumers.
- Either honour or remove #3 (
block_on_failure).
Nothing here blocks the Phase 1 baseline landing — happy to see the surface area shape up. Worth a fixup pass on the bugs before Phase 2 builds on it.
Summary
Implements Phase 1 of the autonomous AI agent orchestration roadmap (EPIC1, EPIC3, EPIC4 baseline):
execution—RunContext, append-onlyTransitionLog,TracedRunner,ReplayRunner,StateValidatorfor traceable, auditable runstools—ToolPolicyEngine(fail-closed, blocked patterns, approval routing),SubprocessSandbox(timeout + cwd roots),ToolNodeConfig(policy + per-tool timeout)guardrails—QualityGateNode,ReviewFindingschema,RiskClassifier, merge-blocker routing (passed/gate_failed/needs_approval)Also adds
autonomous_dev_workflowexample, docs/ROADMAP-18.md, README roadmap updates, and acargo testjob invalidate.ymlfor PRs touching Rust sources.Closes #22 (EPIC4 Code Quality Guardrails baseline).
Test plan
cargo test(107 unit + 9 integration tests)cargo run --example autonomous_dev_workflowrustjob on this PRMade with Cursor