Skip to content

feat: Issue #18 Phase 1 — autonomous agent orchestration baseline#50

Merged
stevei101 merged 1 commit into
developfrom
feat/issue-18-autonomous-orchestration-phase1
Jun 3, 2026
Merged

feat: Issue #18 Phase 1 — autonomous agent orchestration baseline#50
stevei101 merged 1 commit into
developfrom
feat/issue-18-autonomous-orchestration-phase1

Conversation

@stevei101
Copy link
Copy Markdown
Contributor

Summary

Implements Phase 1 of the autonomous AI agent orchestration roadmap (EPIC1, EPIC3, EPIC4 baseline):

  • executionRunContext, append-only TransitionLog, TracedRunner, ReplayRunner, StateValidator for traceable, auditable runs
  • toolsToolPolicyEngine (fail-closed, blocked patterns, approval routing), SubprocessSandbox (timeout + cwd roots), ToolNodeConfig (policy + per-tool timeout)
  • guardrailsQualityGateNode, ReviewFinding schema, RiskClassifier, merge-blocker routing (passed / gate_failed / needs_approval)

Also adds autonomous_dev_workflow example, docs/ROADMAP-18.md, README roadmap updates, and a cargo test job in validate.yml for PRs touching Rust sources.

Closes #22 (EPIC4 Code Quality Guardrails baseline).

Test plan

  • cargo test (107 unit + 9 integration tests)
  • cargo run --example autonomous_dev_workflow
  • Verify CI rust job on this PR

Made with Cursor

Deliver EPIC1 traced execution, EPIC3 tool policy/sandbox, and EPIC4
quality gates so agent workflows can run with traceability, bounded tool
risk, and CI-style merge routing before shipping changes.

Closes #22

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor Author

@stevei101 stevei101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Overview

Phase 1 of the autonomous-agent orchestration roadmap (#18). Three new modules — execution (RunContext, TransitionLog, TracedRunner, ReplayRunner, StateValidator), tools (ToolPolicyEngine, SubprocessSandbox, ToolNodeConfig), guardrails (QualityGateNode, ReviewFinding, RiskClassifier, merge routing) — plus an example, docs, and a cargo test CI job. 2007 LOC, 22 files, 107 + 9 tests.

Solid baseline. The shape (per-module types, clean re-exports through prelude, dependency injection via traits for the command runner / sandbox executor) is sound and matches the existing project style. Issues below are mostly hygiene + a few real bugs.

Correctness — real bugs

  1. truncate_log will panic on multi-byte UTF-8. src/guardrails/gate.rs:

    fn truncate_log(s: &str, max: usize) -> String {
        if s.len() <= max { s.to_string() } else { format!("{}…", &s[..max]) }
    }

    &s[..max] panics if max lands inside a multi-byte char. Tool output (especially cargo with emoji/colors) routinely contains UTF-8. Use chars().take(max).collect::<String>() or floor_char_boundary.

  2. Policy blocked_patterns bypassed for RequireApproval tools. src/tools/policy.rs::evaluate checks approval_required before blocked_patterns. So an approval-required shell tool whose command contains sudo or rm -rf / will route through approval and (presumably) execute on human OK — but the human shouldn't even see those. Reorder: deny → blocked_patterns → approval_required → allow → fail-closed.

  3. QualityGateConfig.block_on_failure is dead. Set in rust_defaults() and the example, never read by QualityGateNode::execute. The node routes purely off merge_blocker.blocked. Either honour the flag or drop the field.

  4. ReviewFinding::error(...).at("Cargo.toml", 1) is a fake location. src/guardrails/gate.rs::run_checks always tags failed-check findings as Cargo.toml:1 regardless of which check failed. Misleading in PR comments. Drop the .at(...) call when there's no real location, or parse it out of the check output.

  5. ReplayRunner doesn't replay — it compares. Naming oversells. ReplayRunner::compare(expected, actual) is a log-diff. The actual primitive needed to "validate deterministic replay" is to re-run the graph from a recorded log + initial state and check the produced log matches. Consider renaming to TransitionLogDiff or wiring up an actual re-execution path (the stub output_from_kind hints at the intent but is dead-code).

Correctness — minor

  1. Iteration counter type mismatch. TracedRunner::run_loop uses iterations: u32; state.iteration is usize; TransitionRecord has both (iteration: u32, state_iteration: usize). Pick one, propagate.

  2. resolve_next silently routes Continue(None) to END but warns on missing Transition(key) edge. Inconsistent. Either both should warn, or document why Continue(None) ending the graph is silent-by-design.

  3. Policy command_hint extraction assumes arguments["command"]. Many shell tools use script, cmd, bash_cmd. Document the convention or take the arg name as a config field on the policy.

  4. SubprocessSandbox::validate_working_dir uses blocking std::fs::canonicalize inside async. Tokio runtime won't deadlock for short paths but it's a smell — use tokio::fs::canonicalize or spawn_blocking.

Architecture / convention

  1. src/nodes/quality_gate.rs is just a re-export of src/guardrails. Two import paths for the same types (crate::nodes::QualityGateNode and crate::guardrails::QualityGateNode). Pick one canonical location. The guardrails home reads better — built-in node modules become discoverable via nodes only when they don't have a richer home.

  2. SubprocessSandbox is a subprocess runner with a timeout, not a sandbox. It does:

    • Timeout ✅
    • cwd allowlist ✅ (but default = empty = any cwd)
    • sh -c "$cmd" — no fs/process/network isolation, classic shell-injection surface

    Either rename to SubprocessRunner or be explicit in the docstring that this is the absolute minimum and a "real" sandbox needs namespaces/seccomp/etc. The current name suggests a security boundary it doesn't provide.

  3. SandboxConfig::default() allows any cwd. A sandbox that defaults to "no cwd restriction" is barely a sandbox. Either tighten the default or rename the constructor to permissive.

  4. Asymmetric ReviewFinding constructors. Only ::error. Add ::info and ::warning for symmetry, otherwise people will reach for struct-literal syntax which couples to internal fields.

  5. use super::transition::TransitionLog; after TracedRunResult definition. Stylistic — Rust allows it, but conventional placement is at the top with other uses. The current order made me re-read to confirm scope.

Performance

  1. QualityGateNode::run_checks runs checks sequentially. For independent gates (fmt, clippy, test), tokio::join!/buffer_unordered would parallelise. Big win for the canonical Rust defaults — fmt is ~1s, clippy 30s+, test 60s+; they're independent.

Test coverage

  1. No test for the policy bypass bug (#2).
  2. No test for SubprocessSandbox::validate_working_dir.
  3. No test for StateValidator's schema-based path — only the require_key path is exercised.
  4. ReplayRunner tests only test the comparator, not full replay. Reinforces the naming concern (#5).

CI

  1. cargo test job has no paths: filter. Will run on doc-only PRs. Minor.

Security

  1. The big one is #11 — "Sandbox" branding. If autonomous agents in the wild end up trusting SubprocessSandbox because its name suggests isolation, the project may inherit incidents. Either ship real isolation primitives (rootless containers, seccomp profile, network namespace) or pick a name that doesn't claim more than it delivers.

  2. Policy RequireApproval has no approval primitive yet. The PR description claims "approval routing"; the actual flow returns an error message in ToolResult. There's no human-in-the-loop handshake. Worth flagging in docs/ROADMAP-18.md or PR body what "approval routing" means vs. what's deferred.

Priority for follow-up

  1. Fix #1 (UTF-8 panic) — real crash risk.
  2. Fix #2 (policy ordering) — security correctness.
  3. Fix #4 (Cargo.toml:1 fake location) — quality of agent outputs.
  4. Resolve #11 (Sandbox naming/scope) — sets expectations for downstream consumers.
  5. Either honour or remove #3 (block_on_failure).

Nothing here blocks the Phase 1 baseline landing — happy to see the surface area shape up. Worth a fixup pass on the bugs before Phase 2 builds on it.

@stevei101 stevei101 merged commit c5a5f0b into develop Jun 3, 2026
5 checks passed
@stevei101 stevei101 deleted the feat/issue-18-autonomous-orchestration-phase1 branch June 3, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant