Skip to content

Phase 1 · SAO feedback sensors — output quality evaluation beyond exit code #45

@Luis85

Description

@Luis85

Meta

type: DesignDecision
stage: draft
maturity: L1
created: 2026-05-10
inputs:
  - "Luis85/specorator specs/specorator-agent-orchestrator/design.md — success criteria"
  - "OpenAI harness engineering — feedback controls as sensors"
  - "Martin Fowler — computational vs. inferential sensors"
related: ["#43", "#44", "#46", "#21"]

Purpose. Design the feedback harness layer that validates agent output quality after execution — the "sensors" that prevent low-quality work from advancing the workflow stage.


Context

Harness engineering distinguishes:

  • Computational sensors: deterministic, fast (milliseconds–seconds), CPU-based. Include exit codes, file presence, schema checks, structural validation. Highly reliable.
  • Inferential sensors: semantic analysis via LLMs / AI judges. Slower, non-deterministic, but enable rich quality judgments.

The SAO design doc currently defines success as: exit code 0 AND stage artifact present.

This is a minimal structural check (two computational sensors). Research finding from OpenAI's harness engineering post: agents are systematically bad at evaluating their own output, especially as context fills. External evaluation is required for production-grade harness reliability.


Sensor hierarchy

Level Sensor Type Failure action
L1 Exit code 0 Computational Retry with backoff
L2 Stage artifact present Computational Retry with backoff
L3 Artifact schema valid (required sections, notation) Computational Retry with backoff
L4 Context-window guard (min content thresholds, sentinel sections) Computational Retry or CONTEXT_EXHAUSTION (→ #46)
L5 LLM judge quality evaluation Inferential Retry or human review
L6 Human review gate Manual Hold in pending-review until approved

L1–L4 are in scope for V1. L5–L6 are design decisions for this issue.


Sensor integration with state machine

AgentRunner exits
    └─ L1 check (exit code)      ──fail──→ retry-queued
        └─ L2 check (artifact)   ──fail──→ retry-queued
            └─ L3 check (schema) ──fail──→ retry-queued
                └─ L4 check (guard) ─fail─→ CONTEXT_EXHAUSTION → released
                    └─ [L5 if enabled] ─fail─→ retry-queued or pending-review
                        └─ success → merge + stage advance (or L6 review gate)

Open questions for this issue

  1. L5 in V1? Which stages, if any, warrant LLM judge evaluation in V1 (cost and latency are real)?
  2. Quality threshold: what constitutes "good enough" for automatic advancement via L5?
  3. L6 integration: how does the human review gate surface in the StatusSurface and fleet dashboard (specorator#168)?
  4. Retry vs. release decision: at what point does repeated sensor failure trigger released instead of retry-queued? (Already specified for retry count; does sensor type affect this?)
  5. Sensor configurability: should stages declare their required sensor level in the template frontmatter (→ Phase 1 · SAO prompt template system — feedforward harness guides and stage-template schema #44)?

Acceptance

Metadata

Metadata

Assignees

No one assigned

    Labels

    roadmap:architecturePhase 1: ratified architecture proposal, data model, and design decisions before code.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions