You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Purpose. Design the feedback harness layer that validates agent output quality after execution — the "sensors" that prevent low-quality work from advancing the workflow stage.
Context
Harness engineering distinguishes:
Computational sensors: deterministic, fast (milliseconds–seconds), CPU-based. Include exit codes, file presence, schema checks, structural validation. Highly reliable.
Inferential sensors: semantic analysis via LLMs / AI judges. Slower, non-deterministic, but enable rich quality judgments.
The SAO design doc currently defines success as: exit code 0 AND stage artifact present.
This is a minimal structural check (two computational sensors). Research finding from OpenAI's harness engineering post: agents are systematically bad at evaluating their own output, especially as context fills. External evaluation is required for production-grade harness reliability.
L5 in V1? Which stages, if any, warrant LLM judge evaluation in V1 (cost and latency are real)?
Quality threshold: what constitutes "good enough" for automatic advancement via L5?
L6 integration: how does the human review gate surface in the StatusSurface and fleet dashboard (specorator#168)?
Retry vs. release decision: at what point does repeated sensor failure trigger released instead of retry-queued? (Already specified for retry count; does sensor type affect this?)
Meta
Context
Harness engineering distinguishes:
The SAO design doc currently defines success as:
exit code 0 AND stage artifact present.This is a minimal structural check (two computational sensors). Research finding from OpenAI's harness engineering post: agents are systematically bad at evaluating their own output, especially as context fills. External evaluation is required for production-grade harness reliability.
Sensor hierarchy
CONTEXT_EXHAUSTION(→ #46)pending-reviewuntil approvedL1–L4 are in scope for V1. L5–L6 are design decisions for this issue.
Sensor integration with state machine
Open questions for this issue
releasedinstead ofretry-queued? (Already specified for retry count; does sensor type affect this?)Acceptance
releasedescalation rules per sensor level specifiedRetryEntry.reason?)