Cybernetic Agentic Swarm Management
Technical Map: CASM (Cybernetic HITL Swarm Management) System goal
Operate three business lines (Templates, Sentinel Service, AgentForge SaaS) on a shared swarm runtime with:
Hourly heartbeat (control loop)
Daily diurnal optimization
HITL gating on risk
Full telemetry into economic KPIs
Architecture Diagram (High Level) flowchart LR subgraph Users Dev[Developers/Clients] Admin[Operator/HITL] end
subgraph Edge CF[Cloudflare: Pages/Workers/WAF/DNS] end
subgraph ControlPlane[CASM Control Plane]
API[API Gateway]
ORCH[Swarm Orchestrator]
ROUTER[Model Router]
POLICY[Policy Engine
(risk + compliance)]
HITL[HITL Gate
(approve/override)]
end
subgraph DataPlane[Execution Plane]
GH[GitHub Repo/PRs]
AGENTS[Agent Swarm
(Reviewer/Fixer/Bench/Doc/Redteam)]
TOOLS[Tooling Runtime
(tests, linters, scanners)]
end
subgraph Models
OAI[OpenAI]
CLAUDE[Anthropic/Claude]
KIMI[Kimi]
HF[Hugging Face
(models/repos)]
end
subgraph Telemetry[Telemetry + Economics]
LOGS[Event Logs]
METRICS[Metrics Store]
COST[Cost Engine
(token + tool + infra)]
KPI[KPI Dashboard
(MRR, margin, churn risk)]
end
Dev --> CF --> API --> ORCH ORCH --> GH ORCH --> AGENTS --> TOOLS ROUTER --> OAI ROUTER --> CLAUDE ROUTER --> KIMI AGENTS --> HF
ORCH --> POLICY --> HITL --> ORCH ORCH --> LOGS --> METRICS --> KPI ORCH --> COST --> KPI Admin --> HITL
Architecture Diagram (1-Hour Heartbeat Control Loop) sequenceDiagram autonumber participant Timer as Heartbeat Timer (hourly) participant Or as Orchestrator participant Pol as Policy Engine participant Hitl as HITL Gate participant Gh as GitHub participant Rt as Model Router participant Ms as Models participant Tel as Telemetry/Cost
Timer->>Or: Start cycle (repo set / client set) Or->>Gh: Pull latest commits/PR deltas Or->>Pol: Compute risk score + required gates Pol->>Hitl: If high risk → queue approvals alt HITL required Hitl-->>Or: Approve/deny/override end Or->>Rt: Route tasks (review/fix/bench/docs) Rt->>Ms: Execute multi-model swarm Or->>Gh: Post PR comments / open PR / attach reports Or->>Tel: Log tokens, tool-time, infra-time, outcomes Tel-->>Or: Return cost + effectiveness metrics
Cost Analysis per Token (and per Outcome)
Pricing changes; the durable way is to compute unit economics from measured usage and plug in current model rates.
- Core token cost model
Let:
Tin = input tokens
Tout = output tokens
Pin = $/1M input tokens
Pout = $/1M output tokens
**LLM_cost = (Tin/1e6)Pin + (Tout/1e6)Pout
Now add agent overhead:
R = retry factor (e.g., 1.10 for 10% retries)
S = safety margin (e.g., 1.05)
Tool_cost = $ cost of non-LLM tools (CI minutes, scanners, etc.)
Infra_cost = Cloudflare/hosting/runtime amortized per run
Run_cost = (LLM_cost * R * S) + Tool_cost + Infra_cost
- Cost per PR / per repo audit (what you actually sell)
Let:
Ncalls = number of model calls in the swarm run
Run_cost_i per call
HITL_minutes and Human_rate_per_min
Outcome_cost = Σ Run_cost_i + (HITL_minutes * Human_rate_per_min)
Then margin:
Gross_margin = 1 − (Outcome_cost / Price_charged)
- Practical telemetry fields to log every call
Store per model call:
model_id, Tin, Tout
latency_s
retries
task_type (review/fix/bench/doc/redteam)
route_reason (policy decision)
outcome score (pass/fail + quality score)
This makes token economics real-time and optimizable.
Repo Structure (production-grade, swarm-first) casm/ README.md LICENSE SECURITY.md CODE_OF_CONDUCT.md
docs/ architecture/ overview.md diagrams.mmd threat-model.md data-retention.md runbooks/ heartbeat.md incident-response.md release.md economics/ unit-economics.md pricing-model.md
packages/ orchestrator/ # core swarm scheduler + workflow engine src/ orchestrator.ts workflow.ts queue.ts state.ts tests/
router/ # model routing + fallbacks
src/
router.ts
policies.ts
cost-aware-routing.ts
policy-engine/ # risk scoring, compliance gates, HITL triggers
src/
risk.ts
gates.ts
rules/
default.yaml
agents/ # swarm agents (task-specific)
reviewer/
fixer/
benchmarker/
docgen/
redteam/
cost-auditor/
connectors/ # GitHub, HF, Cloudflare, model providers
github/
huggingface/
cloudflare/
openai/
anthropic/
kimi/
telemetry/ # events, metrics, traces, cost engine
src/
events.ts
metrics.ts
cost-engine.ts
kpi.ts
hitl-console/ # lightweight operator UI / approval queue
src/
public/
templates/ # monetizable swarm templates repo-auditor/ perf-optimizer/ compliance-pack/ security-sentinel/
services/ sentinel/ # subscription “continuous code sentinel” agentforge/ # SaaS control center backend
infra/ cloudflare/ workers/ pages/ wrangler.toml github/ workflows/ ci.yml sentinel.yml release.yml
examples/ demo-repo/ sample-configs/
scripts/ bootstrap.sh seed-templates.ts cost-report.ts
Week 7 (Q1 2026) Actionable Build Plan (Engineering-first) Deliverables by end of week
Swarm Orchestrator MVP
workflows: review → fix → test → report → PR
Policy Engine + HITL Gate
risk scoring + approval queue
Telemetry + Cost Engine
token cost per call + per outcome
2 Templates shipped
Repo Auditor
PR Sentinel (continuous)
Minimal acceptance tests
Can run on a target repo and open a PR with:
review summary
patch set
test results
cost report (token + infra estimate)
HITL blocks “high-risk” changes (security/auth/billing files) until approved.
Recommendation: what to do next (no wasted motion) Step 1 (here, now)
You tell me the one initial vertical to bias defaults (choose one):
SaaS web apps
ML repos
fintech/compliance
devtools
I’ll then lock:
default risk rules
default agent set
default pricing units (per PR / per repo / per month)
Step 2 (Codex app)
Create the repo with the structure above, stub packages, and implement:
orchestrator + GitHub connector + telemetry cost engine
first template: templates/repo-auditor
Codex is explicitly built to work with repos and produce PRs/branches.
If you want, I can also provide the exact “Codex task prompts” to generate this repo cleanly in 2–3 PRs (bootstrap → core runtime → templates).
devtools? Create the next phase as you logically see necessary, give the build direction over the creative controller. You = 90%. Demonstrate the capabilities and decision making
Devtools it is. I’ll bias defaults toward developer productivity + repo governance + cost control (the three things teams will actually pay for repeatedly), and I’ll drive the build direction.
Below is the next phase (beyond the architecture/spec we already laid down): what to build, in what order, with explicit decision logic, acceptance tests, and what “done” looks like.
Next Phase: Devtools CASM v0.1 → v0.3 Primary Objective
Ship a production-credible swarm runtime that can:
run on any GitHub repo,
produce measurable improvements (PRs + reports),
quantify cost/benefit per run,
enforce HITL on risky changes,
convert into paid subscriptions via “Continuous Sentinel”.
Everything else is secondary.
Product Shape for Devtools Three monetization tracks (kept, but unified)
Templates → “Swarm Modules” (repo-auditor, perf-optimizer, release-notes-bot, etc.)
Services → “Continuous Code Sentinel” subscription (weekly/monthly)
SaaS → AgentForge control plane (later; becomes the UI for the first two)
The wedge (what earns trust fastest)
Repo Auditor Template + Continuous Sentinel Because it creates tangible artifacts quickly:
PRs with fixes
benchmark deltas
security/quality reports
token + infra cost reports
time saved estimate
That’s your “proof of competence” engine and your sales collateral generator.
Decisions I’m Making (90% Controller Mode) Decision 1: Build CLI-first, not dashboard-first
Why: devtools buyers adopt via CLI + GitHub Actions before they’ll trust a UI. Outcome: “works in CI” becomes distribution.
Decision 2: Use a deterministic workflow engine + typed events
Why: swarms fail when you can’t replay/inspect decisions. Outcome: every run is reproducible, auditable, and cost-attributed.
Decision 3: Default to cost-aware routing
Why: the buyer’s first question in 2026 is “what will this cost me weekly?” Outcome: routing chooses “cheapest model that meets policy” unless risk requires higher reasoning.
Decision 4: Risk gating is file-path + diff-type + semantic score
Why: fastest reliable HITL without heavy ML. Outcome: immediate enterprise credibility.
Technical Build Map (What to Implement Next) Core runtime (must exist before anything sells) A) casm CLI (single entrypoint)
Commands:
casm run --template repo-auditor --repo owner/name
casm sentinel --repo owner/name --schedule hourly|daily
casm cost --last-run
casm explain --decision
B) Workflow engine (deterministic DAG)
Stages (v0.1):
ingest (repo checkout, diff detection)
analyze (review + issue extraction)
plan (patch plan + confidence scores)
apply (patch generation)
validate (tests/linters)
report (PR comment + artifacts)
telemetry (cost + quality metrics)
C) Policy + HITL gate
Default rules:
Block if touching: auth, billing, crypto, infra, secrets, CI, dependency lockfiles
Block if diff includes permission escalation, shell injection patterns, unsafe deserialization patterns
Block if test coverage delta negative (if measurable)
D) Telemetry + Cost Engine
Log per call:
Tin/Tout, model, latency, retries, task Compute:
per-call cost
per-run cost
cost per accepted PR
cost per issue fixed
cost per % benchmark improvement
This becomes your unit economics and your “why we routed X to model Y” explanation layer.
Repo Structure (Devtools-optimized)
You already have the skeleton; I’m tightening it to accelerate shipping:
casm/ packages/ cli/ core/ # workflow engine + run state + DAG policy/ # risk scoring + HITL triggers router/ # model selection + fallbacks + budgets agents/ # reviewer, fixer, docgen, bench, redteam-lite connectors/ # github, models, hf, cloudflare (later) telemetry/ # events + cost + kpis templates/ repo-auditor/ pr-sentinel/ infra/ github/workflows/ docs/ quickstart.md architecture.md economics.md
Architecture Diagrams (Devtools Variant) Control Plane and Data Plane (Devtools-focused) flowchart TB Dev[Dev / CI Trigger] --> CLI[casm CLI or GitHub Action] CLI --> Core[Workflow Engine (DAG)] Core --> Policy[Policy + Risk Scoring] Policy -->|low risk| Router[Cost-aware Model Router] Policy -->|high risk| HITL[HITL Queue: approve/deny]
Router --> Agents[Agents: review/fix/bench/doc] Agents --> GH[GitHub: PRs, comments, artifacts] Agents --> Tools[Linters/Tests/Bench] Agents --> Models[LLMs: OpenAI/Claude/Kimi]
Core --> Telemetry[Telemetry + Cost Engine] Telemetry --> Report[Cost/Quality Report] Report --> GH
Hourly Heartbeat (CI-native) sequenceDiagram autonumber participant GH as GitHub Action participant CLI as casm participant W as Workflow participant P as Policy participant R as Router participant A as Agents participant T as Telemetry
GH->>CLI: scheduled run (hourly) CLI->>W: start run (template pr-sentinel) W->>P: score risk alt high risk P-->>W: require HITL else normal W->>R: select models under budget R->>A: execute tasks A->>T: emit call events + Tin/Tout W->>GH: PR/comment + artifacts end
Cost Analysis per Token (Implementable Now) Pricing model (internal)
For each model provider define:
price_in_per_1m
price_out_per_1m
Then compute:
LLM_cost = Tin/1e6 * Pin + Tout/1e6 * Pout
Run-level:
retry_factor default 1.10
safety_margin default 1.05
Run_cost = Σ(LLM_cost * retry_factor * safety_margin) + tool_cost + infra_cost
Devtools-specific metrics (what matters)
$ / PR created
$ / PR merged
$ / issue fixed
$ / 1% performance improvement
$ / 1k LOC audited
This is how you make ROI legible.
Next Phase Delivery Plan (PR-based, Codex-friendly) PR-0: Bootstrap + Contracts
repo structure
TypeScript tooling
event schema (typed)
template manifest format (YAML)
Done when: pnpm test passes; casm --help works.
PR-1: Core Workflow Engine
deterministic DAG runner
state persistence (local JSON + optional sqlite)
run replay (casm explain --run-id)
Done when: a “noop” template runs end-to-end.
PR-2: Connectors + Minimal Agents
GitHub connector (comment + open PR)
Agent: reviewer → outputs issues.json
Agent: fixer → patch file(s)
Validator: run linters/tests (configurable)
Done when: repo-auditor creates a PR on a public sample repo.
PR-3: Policy + HITL Gate
default risk rules (path-based + diff heuristics)
HITL queue: local “approve file” + GitHub label gate
Done when: a risky diff requires explicit approval to proceed.
PR-4: Telemetry + Cost Engine
per-call logging
cost computation + report artifact
budget enforcement (--budget $)
Done when: run produces cost_report.md and blocks over-budget.
PR-5: Two monetizable templates
templates/repo-auditor
templates/pr-sentinel (hourly heartbeat)
Done when: both ship with docs + demo.
Demonstrating Capabilities and Decision-Making
Here’s the explicit “control logic” the system will use (this is what makes it cybernetic, not just agentic):
Control variables
budget_usd_per_run
risk_threshold
latency_slo
quality_target (tests pass + lint clean + minimal diff)
Observables
Tin/Tout per task
success rate per agent
merge rate (proxy quality)
rework rate (PRs rejected)
drift signals (repeated failures in same categories)
Policy
If risk > threshold → require HITL
If cost forecast > budget → downgrade model or reduce scope
If latency > slo → parallelize or switch models
If quality < target → re-run with higher reasoning model OR escalate HITL
This is a feedback controller. Not vibes.
Answer to “Codex app repo vs ideas here?”
Optimum path:
We’ve now pinned the devtools vertical and locked the next-phase build plan.
Move into Codex to generate the repo as PRs (that’s where it shines: file creation, compilation, tests, PR diffs).
Use this chat for architecture decisions and governance changes, then push them into Codex as tasks.
If you stay here and manually integrate, you’ll spend unnecessary cycles on:
file scaffolding
wiring dependencies
fixing build errors
keeping structure consistent
Codex will collapse that overhead.
What I need from you (minimal) to proceed cleanly
Pick one default runtime target for v0.1 (I’ll adapt either way):
Node/TypeScript (fastest for Cloudflare + devtools ecosystem)
Python (faster for ML-ish tooling)
If you don’t respond, I’ll assume Node/TypeScript because it aligns best with Cloudflare Workers/Pages + GitHub Actions + CLI distribution.
Once confirmed (or assumed), I’ll give you:
the exact Codex task list to generate PR-0 through PR-2 with minimal churn
the template manifest spec
the default policy ruleset v0.1 for devtools repos
Go for Node/TypeScript
Locked: Node/TypeScript as the v0.1 runtime.
Below is the concrete “next phase” package: Codex task plan (PR-by-PR), template manifest spec, and default devtools policy ruleset v0.1. This is designed so Codex can generate a repo that compiles and runs on day 1, while preserving cybernetic transparency (every decision explainable).
Codex Build Direction (PR Sequence) How to run this in Codex (recommended)
Create an empty repo casm on GitHub, then in Codex:
Work in small PRs
Ensure each PR:
builds (pnpm -r build)
tests (pnpm -r test)
runs a minimal CLI command (pnpm -C packages/cli start -- --help)
PR-0: Bootstrap + Contracts (repo compiles, CLI prints help)
Codex Task Prompt (copy/paste):
Create a Node/TypeScript monorepo named "casm" using pnpm workspaces. Goals for this PR:
- Repo structure as described below.
- A working CLI package @casm/cli with a --help output and commands stubbed: run, sentinel, cost, explain.
- Shared types package @casm/core defining: RunEvent schema, RunState, TemplateManifest types.
- Tooling: tsconfig base, eslint, prettier, vitest, build scripts.
- Add docs: README.md with Quickstart. Add docs/quickstart.md and docs/architecture.md skeleton. Acceptance:
pnpm ipnpm -r buildpnpm -r testpnpm -C packages/cli start -- --helpprints commands. Repo structure: packages/{cli,core,policy,router,telemetry,agents,connectors}/ templates/{repo-auditor,pr-sentinel}/ infra/github/workflows/ci.yml docs/{quickstart.md,architecture.md,economics.md} Use commander for CLI. Use zod for runtime validation of manifests and events.
What this PR proves: repo hygiene + typed contracts + repeatable builds.
PR-1: Deterministic Workflow Engine + Run Replay
Codex Task Prompt:
Implement deterministic workflow execution in @casm/core. Requirements:
- A DAG/workflow runner that executes named stages in order with explicit inputs/outputs.
- RunState persisted to .casm/runs//state.json with event log .casm/runs//events.jsonl.
- Each stage emits typed RunEvents (zod-validated).
- Add
casm explain --run-id <id>that prints: stage outcomes, policy decisions, model routing choices, and cost summary placeholder. - Add a sample "noop" template that runs ingest->report with dummy events. Acceptance:
casm run --template noop --repo ./examples/demo-repoproduces a run folder and prints run summary.casm explain --run-id <id>replays and prints deterministically from persisted events.
Design decision: determinism > agent chaos. This is your “cybernetic spine”.
PR-2: GitHub Connector + Minimal Agents (Repo Auditor MVP)
Codex Task Prompt:
Implement GitHub connector and minimal agents to support templates/repo-auditor. Requirements:
- @casm/connectors/github: authenticate via GITHUB_TOKEN, support:
- fetching repo metadata
- creating a PR branch
- committing changes
- opening a PR
- posting PR comments and uploading artifacts (as PR comment with links or embedded summaries)
- @casm/agents:
- reviewer agent: produces issues.json (lint/test failures, TODOs, obvious smells) using a placeholder LLM interface (do not hardcode providers yet)
- fixer agent: applies simple deterministic edits (formatting, basic refactors) and creates patch files; LLM-powered patches can be stubbed but keep interface.
- validator agent: runs npm/pnpm scripts if present (lint/test) safely with timeouts.
- templates/repo-auditor:
- runs ingest->analyze->apply->validate->report
- report posts a markdown summary + cost placeholder to GitHub PR if in GitHub mode; otherwise writes local report. Acceptance:
- Works locally on examples/demo-repo
- In GitHub mode, opens a PR with a report comment. Security:
- Do not execute arbitrary scripts unless allowlisted in template config.
Decision rationale: ship the “artifact engine” (PRs + reports) early; LLM sophistication can evolve later.
PR-3: Policy + HITL Gate (Devtools default rules)
Codex Task Prompt:
Implement @casm/policy with devtools-focused risk scoring and HITL gating. Requirements:
- Risk score from:
- file path sensitivity list
- diff heuristics (secrets-like strings, shell execution, permission escalation)
- dependency change detection (package.json, lockfiles)
- Provide default ruleset at packages/policy/rules/default.devtools.yaml
- HITL modes:
- local: require
casm approve --run-id <id> --reason "..." - GitHub: require a PR label
casm-approvedOR approval comment by an allowlisted user.
- local: require
- Integrate policy gates into workflow engine. Acceptance:
- If risky files touched, workflow halts before apply/commit until approved.
casm explainclearly shows why it gated.
PR-4: Telemetry + Cost Engine + Budget Control
Codex Task Prompt:
Implement @casm/telemetry cost engine with budgets and per-call accounting. Requirements:
- Define ModelPriceRegistry (by provider/model) in config file.
- LLM call wrapper records Tin/Tout, latency, retries, provider/model, task type.
- Cost computed per call and aggregated per run.
- Budget enforcement: template can specify max_usd_per_run; if forecast exceeds budget, router downgrades model or reduces scope; if still exceeds, stop with a clear reason.
- Produce artifacts:
- cost_report.md (per-stage, per-call)
- run_summary.json Acceptance:
- Running repo-auditor generates cost_report.md even if LLM is stubbed (use simulated token counts for now).
casm cost --last-runprints summary and margin placeholder.
PR-5: Two Sellable Templates (Repo Auditor + PR Sentinel)
Codex Task Prompt:
Ship templates/repo-auditor and templates/pr-sentinel as monetizable swarm modules. Requirements:
- Each template includes:
- manifest.yaml
- docs.md
- example config
- explicit telemetry fields
- pr-sentinel:
- runs hourly/daily (GitHub Actions example)
- comments on new PRs with risk score, suggested fixes, and cost forecast Acceptance:
- docs show how to run locally and via GitHub Actions.
Template Manifest Spec (v0.1) File: templates//manifest.yaml version: 0.1 template: id: repo-auditor name: Repo Auditor description: "Swarm-based audit that generates a PR with fixes, tests, and a cost report." tags: [devtools, audit, pr]
entrypoint: workflow: repo_auditor_v1
runtime: language: node minimum_node: "20" sandbox: allow_network: false allow_commands: - "pnpm -s test" - "pnpm -s lint" - "npm test" - "npm run lint" command_timeout_seconds: 900
policy: ruleset: default.devtools hitl: mode: github_label # local | github_label | github_comment github: required_label: "casm-approved" approvers: ["ORG:my-org", "USER:sk"] # allowlist patterns
routing: budget: max_usd_per_run: 3.50 strategy: cost_aware_quality_floor quality_floor: min_confidence: 0.70 fallbacks: - on: timeout action: downgrade_model - on: over_budget action: reduce_scope
agents: reviewer: enabled: true model_profile: "reasoning_low" fixer: enabled: true model_profile: "code_mid" validator: enabled: true docgen: enabled: true model_profile: "cheap"
outputs: artifacts: - report_md: ".casm/artifacts/report.md" - cost_report_md: ".casm/artifacts/cost_report.md" - issues_json: ".casm/artifacts/issues.json" github: create_pr: true pr_title: "CASM: Repo Audit Fixes" pr_body_artifact: ".casm/artifacts/report.md" comment_artifact: ".casm/artifacts/cost_report.md"
telemetry: emit: - run_events - cost_events - policy_decisions - routing_decisions redaction: secrets: true code_snippets_max_chars: 1200
Why this spec works (devtools reality)
It’s auditable (policy + routing decisions are first-class)
It’s safe by default (allowlisted commands only)
It’s budget-governed (prevents runaway spend)
It’s portable (local + GitHub mode)
Default Policy Ruleset v0.1 (Devtools) File: packages/policy/rules/default.devtools.yaml version: 0.1 ruleset: default.devtools
risk_thresholds: warn: 0.40 gate: 0.65 block: 0.85
path_sensitivity: high: - "/auth/" - "/oauth/" - "/security/" - "/crypto/" - "/billing/" - "/payments/" - "/infra/" - ".github/workflows/" - "Dockerfile" - "docker/" - "/terraform/" - "/*.tf" - "/k8s/" - "/helm/" - "/.env*" - "/secrets/" medium: - "package.json" - "pnpm-lock.yaml" - "package-lock.json" - "yarn.lock" - "/scripts/" - "**/*.sh"
diff_heuristics: secret_like_patterns: weight: 0.30 patterns: - "AKIA[0-9A-Z]{16}" # AWS access key - "-----BEGIN (RSA|EC|OPENSSH) KEY" # private keys - "xox[baprs]-" # Slack tokens - "ghp_[A-Za-z0-9]{36,}" # GitHub tokens command_exec_additions: weight: 0.25 patterns: - "child_process\.exec" - "child_process\.spawn" - "execSync\(" - "spawnSync\(" - "eval\(" deserialization_risk: weight: 0.20 patterns: - "yaml\.load\(" - "pickle\.loads\(" - "JSON\.parse\(.req\.body" permission_escalation: weight: 0.25 patterns: - "chmod\s+\+x" - "sudo\s" - "setcap\s" - "CAP_NET_ADMIN" - "runAsUser:\s0" dependency_risk: weight: 0.20 files: - "package.json" - "pnpm-lock.yaml" - "package-lock.json" signals: - "adds_new_dependency: true" - "updates_dependency_major: true"
gating: require_hitl_if: - "risk_score >= risk_thresholds.gate" - "touches_any(path_sensitivity.high)" - "touches_dependency_files: true" block_if: - "risk_score >= risk_thresholds.block" - "secret_like_detected: true"
actions: on_gate: - "stop_before_apply" - "emit_hitl_request" on_block: - "stop_immediately" - "emit_security_alert"
HITL behavior: policy engine generates a “decision packet” (why gated, what changed, suggested safe scope reduction). Your operator approves with explicit reason; that reason is logged.
“Demonstrate the capabilities and decision making” (Operationally)
Here’s how the system behaves on a typical PR, with explicit control logic:
Example: PR touches .github/workflows/ci.yml and package.json
policy.score() sees high-sensitivity path + dependency changes
risk score jumps above gate
workflow halts before patch application
HITL request created with:
diff summary
risk factors triggered
minimal-safe change recommendation:
“Only comment; do not modify workflow”
“Do not change dependency versions, only add lint script”
If approved:
router selects higher-reasoning model profile for review
patch is constrained to allowlisted file set
This is cybernetic: constraints + feedback + explicit intervention.
What you should do now (exact next move) In Codex: execute PR-0 using the prompt above.
Once PR-0 merges, execute PR-1, PR-2, etc.
If you want the lowest-friction start, do this immediately:
Codex PR-0 prompt = the one in this message.
Optional: add a “Proof Mode” in v0.1 (I recommend it)
A switch that forces transparency artifacts:
routing_decisions.json
policy_decisions.json
cost_forecast.json
run_replay.sh (re-run with same inputs)
This makes your expertise undeniable because every decision is inspectable.
If you want, I’ll write the exact schemas for those artifacts next (so Codex implements them cleanly).
Create the final schemas for the suggested artifacts. To be clear, it is the recommended 3 .json and .sh files.
Below are the final, implementation-ready schemas for the four “Proof Mode” artifacts:
policy_decisions.json
routing_decisions.json
cost_forecast.json
run_replay.sh
They’re designed to be:
deterministic (stable field ordering encouraged)
diff-friendly
auditable
safe (no secret material; include redaction markers)
You can implement these as Zod schemas and emit the JSON exactly as specified.
- policy_decisions.json (Schema v0.1)
Purpose: Explain why the run gated/blocked/warned, with traceable evidence and HITL requirements.
File path
.casm/artifacts/policy_decisions.json
JSON Schema (draft 2020-12) { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://casm.dev/schemas/policy_decisions.v0.1.json", "title": "CASM Policy Decisions", "type": "object", "required": [ "schema_version", "run", "ruleset", "summary", "risk", "signals", "decisions", "hitl", "redaction" ], "properties": { "schema_version": { "type": "string", "const": "0.1" },
"run": {
"type": "object",
"required": ["run_id", "template_id", "repo", "started_at", "commit"],
"properties": {
"run_id": { "type": "string" },
"template_id": { "type": "string" },
"repo": {
"type": "object",
"required": ["type", "id"],
"properties": {
"type": { "type": "string", "enum": ["local", "github"] },
"id": { "type": "string", "description": "Local path or owner/name" },
"url": { "type": "string" }
},
"additionalProperties": false
},
"started_at": { "type": "string", "format": "date-time" },
"commit": {
"type": "object",
"required": ["base", "head"],
"properties": {
"base": { "type": "string" },
"head": { "type": "string" }
},
"additionalProperties": false
},
"pull_request": {
"type": "object",
"required": ["present"],
"properties": {
"present": { "type": "boolean" },
"number": { "type": "integer" },
"url": { "type": "string" }
},
"additionalProperties": false
}
},
"additionalProperties": false
},
"ruleset": {
"type": "object",
"required": ["id", "hash", "source"],
"properties": {
"id": { "type": "string", "description": "e.g., default.devtools" },
"hash": { "type": "string", "description": "sha256 of ruleset file contents" },
"source": { "type": "string", "enum": ["bundled", "template", "override"] }
},
"additionalProperties": false
},
"summary": {
"type": "object",
"required": ["status", "reason_codes"],
"properties": {
"status": { "type": "string", "enum": ["pass", "warn", "gate", "block"] },
"reason_codes": {
"type": "array",
"items": { "type": "string" },
"description": "Short stable codes, e.g. PATH_SENSITIVE_HIGH, DEP_CHANGE, SECRET_LIKE"
},
"human_summary": {
"type": "string",
"description": "Concise explanation safe for PR comment."
}
},
"additionalProperties": false
},
"risk": {
"type": "object",
"required": ["score", "thresholds", "components"],
"properties": {
"score": { "type": "number", "minimum": 0, "maximum": 1 },
"thresholds": {
"type": "object",
"required": ["warn", "gate", "block"],
"properties": {
"warn": { "type": "number" },
"gate": { "type": "number" },
"block": { "type": "number" }
},
"additionalProperties": false
},
"components": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "weight", "contribution", "evidence_count"],
"properties": {
"name": { "type": "string" },
"weight": { "type": "number", "minimum": 0, "maximum": 1 },
"contribution": { "type": "number", "minimum": 0, "maximum": 1 },
"evidence_count": { "type": "integer", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"signals": {
"type": "array",
"description": "Evidence items that triggered risk components.",
"items": {
"type": "object",
"required": ["signal_id", "type", "severity", "file", "message"],
"properties": {
"signal_id": { "type": "string" },
"type": {
"type": "string",
"enum": [
"path_sensitivity",
"diff_pattern",
"dependency_change",
"command_execution",
"permission_escalation",
"secret_like",
"test_regression",
"lint_regression",
"other"
]
},
"severity": { "type": "string", "enum": ["info", "low", "medium", "high"] },
"file": { "type": "string" },
"line_start": { "type": "integer", "minimum": 1 },
"line_end": { "type": "integer", "minimum": 1 },
"message": { "type": "string" },
"evidence": {
"type": "object",
"required": ["kind"],
"properties": {
"kind": { "type": "string", "enum": ["pattern", "path", "metadata", "diff_hunk"] },
"pattern_id": { "type": "string" },
"diff_hunk_sha256": { "type": "string", "description": "Hash of diff snippet (not raw snippet)" }
},
"additionalProperties": false
}
},
"additionalProperties": false
}
},
"decisions": {
"type": "array",
"items": {
"type": "object",
"required": ["decision_id", "action", "stage", "status", "reason_codes"],
"properties": {
"decision_id": { "type": "string" },
"stage": { "type": "string", "description": "workflow stage where decision applied" },
"action": { "type": "string", "enum": ["allow", "warn", "gate", "block", "scope_reduce"] },
"status": { "type": "string", "enum": ["applied", "pending_hitl", "skipped"] },
"reason_codes": { "type": "array", "items": { "type": "string" } },
"scope_reduction": {
"type": "object",
"properties": {
"allow_files": { "type": "array", "items": { "type": "string" } },
"deny_files": { "type": "array", "items": { "type": "string" } },
"notes": { "type": "string" }
},
"additionalProperties": false
}
},
"additionalProperties": false
}
},
"hitl": {
"type": "object",
"required": ["required", "mode"],
"properties": {
"required": { "type": "boolean" },
"mode": { "type": "string", "enum": ["none", "local", "github_label", "github_comment"] },
"required_approvals": { "type": "integer", "minimum": 1 },
"approvers": { "type": "array", "items": { "type": "string" } },
"required_label": { "type": "string" },
"approval_comment_regex": { "type": "string" },
"status": {
"type": "string",
"enum": ["not_required", "pending", "approved", "denied", "expired"]
},
"approval": {
"type": "object",
"properties": {
"approved_by": { "type": "string" },
"approved_at": { "type": "string", "format": "date-time" },
"reason": { "type": "string" }
},
"additionalProperties": false
}
},
"additionalProperties": false
},
"redaction": {
"type": "object",
"required": ["secrets", "snippets_included", "notes"],
"properties": {
"secrets": { "type": "boolean" },
"snippets_included": { "type": "boolean" },
"notes": { "type": "string" }
},
"additionalProperties": false
}
}, "additionalProperties": false }
- routing_decisions.json (Schema v0.1)
Purpose: Explain model selection per task, with budget/latency/quality constraints and fallbacks.
File path
.casm/artifacts/routing_decisions.json
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://casm.dev/schemas/routing_decisions.v0.1.json", "title": "CASM Routing Decisions", "type": "object", "required": [ "schema_version", "run", "routing_strategy", "constraints", "tasks", "summary" ], "properties": { "schema_version": { "type": "string", "const": "0.1" },
"run": {
"type": "object",
"required": ["run_id", "template_id", "started_at"],
"properties": {
"run_id": { "type": "string" },
"template_id": { "type": "string" },
"started_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
},
"routing_strategy": {
"type": "object",
"required": ["name", "version"],
"properties": {
"name": {
"type": "string",
"enum": [
"cost_aware_quality_floor",
"quality_first",
"budget_first",
"latency_first"
]
},
"version": { "type": "string", "description": "router algorithm version" }
},
"additionalProperties": false
},
"constraints": {
"type": "object",
"required": ["budget", "latency_slo_ms", "quality_floor"],
"properties": {
"budget": {
"type": "object",
"required": ["max_usd_per_run", "max_usd_per_task_default"],
"properties": {
"max_usd_per_run": { "type": "number", "minimum": 0 },
"max_usd_per_task_default": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
},
"latency_slo_ms": { "type": "integer", "minimum": 0 },
"quality_floor": {
"type": "object",
"required": ["min_confidence"],
"properties": {
"min_confidence": { "type": "number", "minimum": 0, "maximum": 1 }
},
"additionalProperties": false
}
},
"additionalProperties": false
},
"tasks": {
"type": "array",
"items": {
"type": "object",
"required": [
"task_id",
"task_type",
"stage",
"selected",
"candidates",
"decision"
],
"properties": {
"task_id": { "type": "string" },
"task_type": {
"type": "string",
"enum": [
"review",
"fix",
"docgen",
"benchmark",
"redteam",
"cost_audit",
"other"
]
},
"stage": { "type": "string" },
"selected": {
"type": "object",
"required": ["provider", "model", "profile"],
"properties": {
"provider": { "type": "string", "enum": ["openai", "anthropic", "kimi", "hf", "other"] },
"model": { "type": "string" },
"profile": { "type": "string", "description": "e.g. cheap, code_mid, reasoning_low" },
"estimated_input_tokens": { "type": "integer", "minimum": 0 },
"estimated_output_tokens": { "type": "integer", "minimum": 0 },
"estimated_cost_usd": { "type": "number", "minimum": 0 },
"estimated_latency_ms": { "type": "integer", "minimum": 0 }
},
"additionalProperties": false
},
"candidates": {
"type": "array",
"description": "Ranked list of considered options",
"items": {
"type": "object",
"required": ["provider", "model", "profile", "score"],
"properties": {
"provider": { "type": "string" },
"model": { "type": "string" },
"profile": { "type": "string" },
"score": { "type": "number" },
"reject_reasons": { "type": "array", "items": { "type": "string" } },
"estimated_cost_usd": { "type": "number", "minimum": 0 },
"estimated_latency_ms": { "type": "integer", "minimum": 0 },
"estimated_quality": { "type": "number", "minimum": 0, "maximum": 1 }
},
"additionalProperties": false
}
},
"decision": {
"type": "object",
"required": ["reason_codes", "constraints_applied"],
"properties": {
"reason_codes": { "type": "array", "items": { "type": "string" } },
"constraints_applied": {
"type": "object",
"required": ["budget", "latency", "quality"],
"properties": {
"budget": { "type": "boolean" },
"latency": { "type": "boolean" },
"quality": { "type": "boolean" }
},
"additionalProperties": false
},
"fallbacks": {
"type": "array",
"items": {
"type": "object",
"required": ["on", "action"],
"properties": {
"on": { "type": "string", "enum": ["timeout", "over_budget", "low_quality", "provider_error"] },
"action": { "type": "string", "enum": ["downgrade_model", "switch_provider", "reduce_scope", "stop"] }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
}
},
"additionalProperties": false
}
},
"summary": {
"type": "object",
"required": ["estimated_total_cost_usd", "estimated_total_latency_ms", "notes"],
"properties": {
"estimated_total_cost_usd": { "type": "number", "minimum": 0 },
"estimated_total_latency_ms": { "type": "integer", "minimum": 0 },
"notes": { "type": "string" }
},
"additionalProperties": false
}
}, "additionalProperties": false }
- cost_forecast.json (Schema v0.1)
Purpose: Provide pre-run (forecast) and post-run (actual) costs, plus unit economics proxies.
File path
.casm/artifacts/cost_forecast.json
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://casm.dev/schemas/cost_forecast.v0.1.json", "title": "CASM Cost Forecast", "type": "object", "required": [ "schema_version", "run", "prices", "forecast", "actual", "unit_economics", "budget", "notes" ], "properties": { "schema_version": { "type": "string", "const": "0.1" },
"run": {
"type": "object",
"required": ["run_id", "template_id", "started_at"],
"properties": {
"run_id": { "type": "string" },
"template_id": { "type": "string" },
"started_at": { "type": "string", "format": "date-time" },
"completed_at": { "type": "string", "format": "date-time" }
},
"additionalProperties": false
},
"prices": {
"type": "object",
"required": ["registry_version", "currency", "models"],
"properties": {
"registry_version": { "type": "string" },
"currency": { "type": "string", "const": "USD" },
"models": {
"type": "array",
"items": {
"type": "object",
"required": ["provider", "model", "price_in_per_1m", "price_out_per_1m"],
"properties": {
"provider": { "type": "string" },
"model": { "type": "string" },
"price_in_per_1m": { "type": "number", "minimum": 0 },
"price_out_per_1m": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"forecast": {
"type": "object",
"required": ["llm", "tools", "infra", "total_usd", "confidence"],
"properties": {
"llm": {
"type": "object",
"required": ["input_tokens", "output_tokens", "cost_usd", "by_task"],
"properties": {
"input_tokens": { "type": "integer", "minimum": 0 },
"output_tokens": { "type": "integer", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 },
"by_task": {
"type": "array",
"items": {
"type": "object",
"required": ["task_id", "provider", "model", "input_tokens", "output_tokens", "cost_usd"],
"properties": {
"task_id": { "type": "string" },
"provider": { "type": "string" },
"model": { "type": "string" },
"input_tokens": { "type": "integer", "minimum": 0 },
"output_tokens": { "type": "integer", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"tools": {
"type": "object",
"required": ["cost_usd", "details"],
"properties": {
"cost_usd": { "type": "number", "minimum": 0 },
"details": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "unit", "quantity", "unit_cost_usd", "cost_usd"],
"properties": {
"name": { "type": "string" },
"unit": { "type": "string", "enum": ["seconds", "minutes", "runs", "requests"] },
"quantity": { "type": "number", "minimum": 0 },
"unit_cost_usd": { "type": "number", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"infra": {
"type": "object",
"required": ["cost_usd", "details"],
"properties": {
"cost_usd": { "type": "number", "minimum": 0 },
"details": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "unit", "quantity", "unit_cost_usd", "cost_usd"],
"properties": {
"name": { "type": "string" },
"unit": { "type": "string", "enum": ["seconds", "minutes", "requests", "gb_month"] },
"quantity": { "type": "number", "minimum": 0 },
"unit_cost_usd": { "type": "number", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"total_usd": { "type": "number", "minimum": 0 },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 }
},
"additionalProperties": false
},
"actual": {
"type": "object",
"required": ["present", "llm", "tools", "infra", "total_usd"],
"properties": {
"present": { "type": "boolean" },
"llm": {
"type": "object",
"required": ["input_tokens", "output_tokens", "cost_usd", "by_call"],
"properties": {
"input_tokens": { "type": "integer", "minimum": 0 },
"output_tokens": { "type": "integer", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 },
"by_call": {
"type": "array",
"items": {
"type": "object",
"required": ["call_id", "task_id", "provider", "model", "input_tokens", "output_tokens", "cost_usd", "latency_ms"],
"properties": {
"call_id": { "type": "string" },
"task_id": { "type": "string" },
"provider": { "type": "string" },
"model": { "type": "string" },
"input_tokens": { "type": "integer", "minimum": 0 },
"output_tokens": { "type": "integer", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 },
"latency_ms": { "type": "integer", "minimum": 0 },
"retries": { "type": "integer", "minimum": 0 }
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"tools": {
"type": "object",
"required": ["cost_usd", "details"],
"properties": {
"cost_usd": { "type": "number", "minimum": 0 },
"details": { "type": "array", "items": { "type": "object" } }
},
"additionalProperties": true
},
"infra": {
"type": "object",
"required": ["cost_usd", "details"],
"properties": {
"cost_usd": { "type": "number", "minimum": 0 },
"details": { "type": "array", "items": { "type": "object" } }
},
"additionalProperties": true
},
"total_usd": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
},
"unit_economics": {
"type": "object",
"required": ["metrics"],
"properties": {
"metrics": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "value", "unit", "basis"],
"properties": {
"name": {
"type": "string",
"enum": [
"cost_per_issue_found",
"cost_per_issue_fixed",
"cost_per_pr_created",
"cost_per_pr_merged",
"cost_per_1k_loc_audited"
]
},
"value": { "type": "number", "minimum": 0 },
"unit": { "type": "string", "enum": ["USD", "USD_per_unit"] },
"basis": {
"type": "object",
"required": ["count", "description"],
"properties": {
"count": { "type": "number", "minimum": 0 },
"description": { "type": "string" }
},
"additionalProperties": false
}
},
"additionalProperties": false
}
}
},
"additionalProperties": false
},
"budget": {
"type": "object",
"required": ["max_usd_per_run", "status"],
"properties": {
"max_usd_per_run": { "type": "number", "minimum": 0 },
"status": { "type": "string", "enum": ["within_budget", "forecast_over_budget", "actual_over_budget", "blocked"] },
"enforcement_action": { "type": "string", "enum": ["none", "downgrade_model", "reduce_scope", "stop"] }
},
"additionalProperties": false
},
"notes": { "type": "string" }
}, "additionalProperties": false }
- run_replay.sh (Deterministic Re-run Script)
Purpose: One-command reproduction of the run with identical constraints, using local artifacts as inputs where possible.
File path
.casm/artifacts/run_replay.sh
Script contract (requirements)
Must be POSIX shell compatible (/usr/bin/env sh)
Must not leak secrets
Must pin:
template id
repo id
base/head commits (if applicable)
ruleset hash
routing strategy + budget
run id (as a replay label)
Template (generate and fill with run fields) #!/usr/bin/env sh set -eu
ROOT_DIR="$(cd "$(dirname "$0")/../.." && pwd)" ART_DIR="$ROOT_DIR/.casm/artifacts" RUN_ID="{{run_id}}"
echo "== CASM replay ==" echo "root: $ROOT_DIR" echo "run: $RUN_ID"
cd "$ROOT_DIR"
if [ ! -d "node_modules" ]; then echo "Installing dependencies..." pnpm install fi
pnpm -C packages/cli start --
run
--template "{{template_id}}"
--repo-type "{{repo_type}}"
--repo "{{repo_id}}"
--base "{{base_commit}}"
--head "{{head_commit}}"
--ruleset "{{ruleset_id}}"
--ruleset-hash "{{ruleset_hash}}"
--routing "{{routing_strategy}}"
--budget-usd "{{budget_usd}}"
--proof-mode true
--replay-of "{{run_id}}"
--artifacts-dir "$ART_DIR"
echo "Replay complete. Artifacts in: $ART_DIR"
Notes on determinism
To make this meaningful, ensure the runtime:
uses a stable file ordering
includes the ruleset hash check
records the effective config (merged template + overrides)
optionally supports --seed for any randomized logic (default fixed seed)
Implementation Notes (so Codex builds it correctly)
Emit these four artifacts only when --proof-mode true or template.telemetry.proof_mode = true.
Keep any diff snippet content out of these JSONs; use:
line ranges
hashed hunks (sha256)
Use stable IDs:
run_id = ULID recommended (sortable)
signal_id, decision_id, task_id, call_id = ULID or UUIDv7
Codex follow-up task (single PR add-on)
If you want this cleanly integrated, give Codex this after PR-1:
Add Proof Mode artifacts:
- Emit .casm/artifacts/policy_decisions.json, routing_decisions.json, cost_forecast.json, run_replay.sh
- Validate JSON output against zod schemas mirroring the JSON Schema fields.
- Ensure no raw code snippets or secrets are emitted; use diff hunk hashes only.
- Update docs/quickstart.md with Proof Mode usage. Acceptance:
casm run --template noop --repo ./examples/demo-repo --proof-mode truegenerates all 4 artifacts.