This guide explains how to interpret analyze-artifacts output and tune the repo-aware autonomy loop over time.
Operations Center provides two complementary tools for understanding and adjusting autonomy behavior.
./scripts/operations-center.sh analyze-artifacts
./scripts/operations-center.sh analyze-artifacts --repo OperationsCenter
./scripts/operations-center.sh analyze-artifacts --repo OperationsCenter --limit 20Reads retained decision and proposer artifacts and prints a per-family table with recommendations. Best for quick human inspection.
family emitted suppressed created guardrail_skipped suppress_rate
observation_coverage 4 2 3 0 33%
test_visibility 1 3 1 0 75%
dependency_drift 2 1 2 0 33%
Flags:
suppress_rate >= 90%→ consider loosening thresholdemitted > 0 but created == 0→ check guardrails or proposer dedupguardrail_skipped > 0→ proposals blocked by budget or cooldown
# Recommendation-only (default, safe, writes artifacts but no config changes)
./scripts/operations-center.sh tune-autonomy
# With wider window
./scripts/operations-center.sh tune-autonomy --window 30
# Auto-apply mode (opt-in, requires env var as second gate)
OPERATIONS_CENTER_TUNING_AUTO_APPLY_ENABLED=1 ./scripts/operations-center.sh tune-autonomy --applyThe regulation loop:
- Aggregates per-family metrics from retained decision + proposer artifacts.
- Applies explicit recommendation rules (over-suppressed → loosen; noisy/low-value → tighten; healthy → keep).
- In auto-apply mode, applies conservative bounded changes to
config/autonomy_tuning.json. - Retains a full audit trail under
tools/report/operations_center/tuning/<run_id>/.
The DecisionEngineService reads config/autonomy_tuning.json at startup if it exists, applying overrides to rule thresholds. To revert a change, delete or edit the file.
Retained artifacts per run:
family_tuning_summary.json— per-family metricstuning_recommendations.json— one recommendation per family with evidencetuning_changes.json— applied and skipped changes with before/after valuestuning_run.json— combined artifact used by cooldown/quota checks
Run tune-autonomy as a periodic maintenance step, not on every autonomy cycle:
- Weekly during the first month of deployment
- Monthly once behavior stabilizes
- After any significant threshold change to validate the change had the intended effect
- After promoting a new candidate family to confirm it's behaving well
For hands-on adjustments (or for families not in the auto-apply allowlist):
observe-repo (daily) -> generate-insights -> decide-proposals -> propose-from-candidates
↓
tune-autonomy (weekly) <- review recommendations
↓
manually edit thresholds or update tuning config
↓
autonomy-cycle --dry-run <- verify output looks right
↓
autonomy-cycle --execute <- go live
Families in _DEFAULT_ALLOWED_FAMILIES fire automatically on every cycle:
| Family | Active by default | Default tier | Risk class |
|---|---|---|---|
observation_coverage |
yes | 1 | logic |
test_visibility |
yes | 1 | logic |
dependency_drift_followup |
yes | 1 | logic |
execution_health_followup |
yes | 1 | logic |
lint_fix |
yes | 2 | style |
type_fix |
yes | 1 | logic |
validation_pattern_followup |
yes | 1 | logic |
ci_pattern |
no — requires --all-families |
1 | logic |
hotspot_concentration |
no — requires --all-families |
1 | structural |
todo_accumulation |
no — requires --all-families |
1 | style |
backlog_promotion |
no — requires --all-families |
1 | logic |
arch_promotion |
no — requires --all-families + health gates |
0 | arch |
Autonomy tiers control the initial Plane task state for created tasks. Tier 2 tasks auto-execute; tier 1 tasks land in Backlog and require human promotion; tier 0 tasks are never created.
# View current tiers
./scripts/operations-center.sh autonomy-tiers show
# Promote a family to auto-execute after confirming track record
./scripts/operations-center.sh autonomy-tiers set --family lint_fix --tier 2
# Demote a family after a bad run
./scripts/operations-center.sh autonomy-tiers set --family type_fix --tier 0When to promote a family to tier 2:
tune-autonomyshowsacceptance_rate >= 80%with ≥ 5 feedback records- No runaway board spam from this family in the last 30 days
- Human review of 3-5 created tasks confirms the scope is consistently bounded
When to demote a family:
acceptance_rate < 30%across 5+ feedback records — proposals are not landing- Tasks are consistently escalated rather than merged
tune-autonomyrecommendstighten_thresholdwithautonomy_tier: decrease
The self-tuning regulator now tracks proposals_merged and proposals_escalated per family by joining feedback records to proposer artifacts. The resulting acceptance_rate = merged / (merged + escalated).
Two new recommendation rules fire based on acceptance rate:
| Pattern | Condition | Action |
|---|---|---|
| Low acceptance | acceptance_rate < 30% AND ≥ 5 feedback records | tighten_threshold — suggests decreasing autonomy tier |
| High acceptance | acceptance_rate ≥ 80% AND ≥ 5 feedback records | keep — suggests increasing autonomy tier |
These suggestions appear in tuning_recommendations.json as suggested_change: {"autonomy_tier": {"direction": "increase|decrease", "step": 1}}. They are advisory — the operator applies them manually via autonomy-tiers set.
The acceptance rate metrics appear in the tune-autonomy output per family. To collect meaningful data, ensure the reviewer watcher is writing feedback records (it does so automatically on merge/escalate), and use the feedback entrypoint for tasks handled manually.
Fires when retained execution artifacts show systemic execution quality problems.
Two patterns:
high_no_op_rate
- Condition:
no_op_count / total_runs >= 0.5andtotal_runs >= 5 - When to loosen: too many spurious proposals for repos that legitimately have many no-op test runs; raise the rate threshold to 0.65 or raise
_MIN_RUNS_FOR_RATE. - When to tighten: lower the threshold if you want earlier warning; e.g. 0.4 on repos where no-ops are reliably a signal of bad task quality.
- Where to change:
src/operations_center/insights/derivers/execution_health.pyconstants_HIGH_NO_OP_RATE_THRESHOLD,_MIN_RUNS_FOR_RATE.
persistent_validation_failures
- Condition:
validation_failed_count >= 3 - When to loosen: repos under active development naturally have transient failures; raise threshold to 5.
- When to tighten: lower to 2 for repos with strict quality gates where even 2 failures warrant a task.
- Where to change:
_VALIDATION_FAILURE_THRESHOLDin the same file.
Fires when a repo signal (e.g. test signal) has been persistently unavailable across snapshots.
- Rule:
ObservationCoverageRule(min_consecutive_runs=2) - When to loosen: High suppress_rate from
cooldown_active; the signal clears and re-appears frequently. - When to tighten: Tasks are created but the signal resolves without action — raise
min_consecutive_runsto 3 or 4. - Conservative default: 2 consecutive runs. Appropriate for early deployment.
Fires when test status has been persistently unknown.
- Rule:
TestVisibilityRule(min_consecutive_runs=3) - When to loosen: Suppress_rate is high because the signal flickers; lower to 2.
- When to tighten: Tasks are created too frequently for transient test signal loss; raise to 4.
- Conservative default: 3 consecutive runs. Higher bar than observation_coverage intentionally.
Fires when dependency drift is persistently detected.
- Rule:
DependencyDriftRule(min_consecutive_runs=2) - When to loosen: Drift is always present (expected in active repos); raise threshold or add a pattern exclusion.
- When to tighten: Not usually needed; drift is a slow-moving signal.
- Conservative default: 2 consecutive runs.
Fires when ruff detects lint violations.
- Rule:
LintDriftRule— fires onlint_drift/presentorlint_drift/worsenedinsights - Default tier: 2 (auto-executes) — style risk class, bounded scope
- When to demote to tier 1: repos where lint fixes have historically caused unintended refactors; demote to tier 1 so a human reviews before execution
- Where to change:
src/operations_center/insights/derivers/lint_drift.py
Fires when ty or mypy reports type errors.
- Rule:
TypeImprovementRule(min_errors=3)— requires ≥3 errors before firing - Default tier: 1 — logic risk class; requires human review before execution
- When to loosen: raise tier to 2 after confirming that auto-generated type fixes are consistently bounded and safe in your codebase
- When to tighten: lower
min_errorsthreshold to 1 if you want earlier warning; raise to 10 if noise is high - Where to change:
src/operations_center/decision/rules/type_improvement.pyconstantmin_errors
Fires when the same Plane task has ≥2 runs and ≥2 validation failures across retained execution artifacts.
- Rule:
ValidationPatternRule— high confidence if ≥3 affected tasks; medium confidence otherwise - Default tier: 1 — logic risk class; investigation required before executing
- When to loosen: lower
_MIN_FAILURES_FOR_PATTERNto 1 if you want earlier warning - When to tighten: raise
_MIN_RUNS_FOR_PATTERNto 3 if transient failures are common - Where to change:
src/operations_center/observer/collectors/validation_history.py
Fires when GitHub check-run history shows failing or flaky checks.
- Rule:
CIPatternRule—checks_failing(confidence=high) orchecks_flaky(confidence=medium) - Default tier: 1 — logic risk class; root cause investigation required
- Promotion criteria: enable once you have ≥2 weeks of CI history baseline and have confirmed that the failing/flaky classification is reliable for your repo
- Thresholds:
FAILING_THRESHOLD=0.7(≥70% fail rate),FLAKY_THRESHOLD=0.2(≥20%) - Where to change:
src/operations_center/observer/collectors/ci_history.py
Fires when a file appears repeatedly in file-hotspot snapshots.
- Status: In
ALL_FAMILIESbut not in_DEFAULT_ALLOWED_FAMILIES. Enable via--all-familiesinautonomy-cycleor by adding toallowed_familiesinDecisionContext. - Promotion criteria: Enable only when you've confirmed that hotspot signals reliably identify files worth decomposing (not just frequently edited files that are intentionally central).
Fires when TODO concentration is high.
- Status: Same gating as
hotspot_concentration. - Promotion criteria: Enable when you've reviewed several TODO signals and confirmed they represent real technical debt, not intentional markers.
The policy-level controls (applied across all families) are:
cooldown_minutes— how long after a candidate was last emitted before it can be emitted again for the same dedup key. Default: 120 minutes.max_candidates— max candidates per decision run. Default: 3.max_candidates_per_family— max 1 per family per run (enforced in policy).
To adjust these for a specific run:
# Watcher uses these defaults; override in DecisionContext when calling directly
./scripts/operations-center.sh decide-proposals --max-candidates 5 --cooldown-minutes 60Or update the watcher defaults in the worker main.py constants if you want permanent changes:
PROPOSAL_COOLDOWN_SECONDS = 20 * 60 # 20 minutes
MAX_PROPOSALS_PER_CYCLE = 4
MAX_PROPOSALS_PER_DAY = 30- Weekly during the first month of deployment.
- After any watcher restart that caused rate-limited runs or budget exhaustion — check if
remaining_exec_capacitysuppression dominated the output. - After promoting a new family — confirm the new family's emit/create ratio is healthy before leaving it enabled permanently.
- After a burst of autonomy tasks — confirm the burst was from real signals, not threshold drift.
| Reason | Meaning | Action |
|---|---|---|
cooldown_active |
Same dedup key was emitted recently | Normal; wait for cooldown to clear |
quota_exceeded |
max_candidates or max_candidates_per_family hit |
Raise limits if signal quality is high |
family_deferred_initial_gating |
Family not in allowed_families |
Promote the family when ready |
proposal_budget_too_low |
Execution budget too low for proposals | Check usage.json; budget resets hourly/daily |
existing_open_equivalent_task |
Board already has an open task with this dedup key | Expected; no action needed |
velocity_cap_exceeded |
≥10 proposals created in the last 24 hours | Wait for the window to pass; or raise max_proposals_per_24h |
proposal_stale_open |
Prior unresolved proposal exceeded expires_after_runs without feedback |
Close or record feedback for the old task; or increase expires_after_runs |
The ConfidenceCalibrationStore filters events by recency when computing acceptance rates. The default window is 90 days.
# View calibration with the default 90-day window
./scripts/operations-center.sh tune-autonomy
# Widen the window to see all historical data
./scripts/operations-center.sh tune-autonomy --window 180The window_days parameter is passed to calibration_for() and report() internally. Events older than the window are excluded from acceptance-rate calculations.
Cleaning up stale events:
from operations_center.tuning.calibration import ConfidenceCalibrationStore
store = ConfidenceCalibrationStore()
store.cleanup_old_events(window_days=90) # removes events older than 90 daysThis is safe to run periodically. It does not affect the recommendation output — recommendations already exclude old events via the window — but it keeps state/calibration_store.json compact.
Why time decay matters: Early feedback records may reflect a different codebase state or an earlier version of Kodo. Excluding stale events prevents historical over-confidence from blocking proposals that are now reliably accepted.
Before the per-cycle proposal cap is applied, proposals are ranked by a utility score:
score = confidence_weight + calibration_bonus + state_bonus - scope_penalty
confidence_weight: 1.0 for high, 0.6 for medium, 0.2 for lowcalibration_bonus: up to +0.3 based on the family's recent acceptance rate (0.0 when no calibration data)state_bonus: +0.1 if the proposal targets a task already in Backlogscope_penalty: -0.2 for medium complexity (3–7 files), -0.5 for high complexity (≥8 files)
High-complexity proposals (≥8 files affected) are automatically placed in Backlog regardless of score. This prevents unachievable-scope proposals from consuming the execution budget.
To inspect current proposal scores for a cycle, add --dry-run and check the retained proposal_candidates.json artifact — the utility_score field is written alongside each candidate.
When you change a threshold, add a comment inline or update this file with the before/after:
# 2026-04-04: raised TestVisibilityRule min_consecutive_runs from 3 to 4
# Reason: test signal flickered on ExternalRepo 3 times in 2 days,
# creating tasks that resolved themselves before kodo could run them.
This audit trail is how you distinguish "threshold correctly tuned" from "threshold silently drifted."