You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Purpose. Detect and recover from the known failure mode where agents wrap up tasks prematurely as their context window approaches its limit — before the stage artifact is genuinely complete.
The failure mode
OpenAI's harness engineering research exposed a systematic failure: agents wrap up tasks prematurely as the context window approaches its limit — not because the work is done, but because they sense the constraint. Exit code 0 does not mean success.
The SAO already requires artifact presence (L2 sensor). But artifact presence alone can still be fooled: an agent that senses context pressure may generate a placeholder artifact to satisfy the structural check while producing a stub.
Manifestation patterns
Pattern
Description
Graceful stub
Agent creates the artifact file but populates it with a minimal placeholder ("I've started the requirements...")
Context-cut summary
Agent produces a document that looks valid but is truncated reasoning with missing sections
False completion signal
Agent exits 0 with artifact present but content doesn't satisfy stage criteria
Hedged handoff
Agent appends "continuing in next session..." — a strong signal of context exhaustion
interfaceContextWindowGuardConfig{minLines: number;// from template frontmatter (→ #44)requiredSections: string[];// from template frontmatterforbiddenPhrases?: string[];// e.g. "continuing in next session", "to be continued"tokenUsageWarningThreshold?: number;// fraction of model limit, e.g. 0.85}
Minimum line count: artifact must meet minLines from the template's successCriteria
Required sections: all requiredSections must be present (heading-level check)
Forbidden phrases: detect hedged-handoff language that signals premature wrap-up
Token usage monitoring: if total tokens consumed exceeds tokenUsageWarningThreshold × modelLimit, flag regardless of structural checks
Failure behaviour
Trigger
Action
Structural check fails, retries remain
retry-queued with CONTEXT_GUARD_FAIL reason
Structural check fails, retries exhausted
released with CONTEXT_EXHAUSTION reason code
Token threshold exceeded (warning)
Log warning; surface in StatusSurface; proceed to other sensors
Token threshold exceeded + structural fail
Skip further retries; released with CONTEXT_EXHAUSTION
Token usage surface
The SAO design doc already notes that AgentRun captures token consumption. This issue requires:
AgentRun.tokenUsage: { input: number; output: number; total: number } — populated from --output-format stream-json
AgentRun.contextExhaustionRisk: boolean — set when total > warningThreshold × modelLimit
Fleet dashboard (specorator#168) surfaces token risk indicator per active run
Open questions
Should forbiddenPhrases be global defaults or stage-configurable?
How does the retry prompt for a CONTEXT_GUARD_FAIL run differ from a plain failure? Should the retry template include a context summary of what the previous attempt produced?
Meta
The failure mode
OpenAI's harness engineering research exposed a systematic failure: agents wrap up tasks prematurely as the context window approaches its limit — not because the work is done, but because they sense the constraint. Exit code 0 does not mean success.
The SAO already requires artifact presence (L2 sensor). But artifact presence alone can still be fooled: an agent that senses context pressure may generate a placeholder artifact to satisfy the structural check while producing a stub.
Manifestation patterns
Detection approach (L4 sensor — see #45)
Structural checks (fast, deterministic)
minLinesfrom the template'ssuccessCriteriarequiredSectionsmust be present (heading-level check)tokenUsageWarningThreshold × modelLimit, flag regardless of structural checksFailure behaviour
retry-queuedwithCONTEXT_GUARD_FAILreasonreleasedwithCONTEXT_EXHAUSTIONreason codereleasedwithCONTEXT_EXHAUSTIONToken usage surface
The SAO design doc already notes that
AgentRuncaptures token consumption. This issue requires:AgentRun.tokenUsage: { input: number; output: number; total: number }— populated from--output-format stream-jsonAgentRun.contextExhaustionRisk: boolean— set whentotal > warningThreshold × modelLimitOpen questions
forbiddenPhrasesbe global defaults or stage-configurable?agent_run.context_exhaustedevent (→ Phase 1 · Failure-event taxonomy — which *.failed events ship in V1 #21)?CONTEXT_GUARD_FAILrun differ from a plain failure? Should the retry template include a context summary of what the previous attempt produced?Acceptance
CONTEXT_EXHAUSTIONreason code added toreleasedstate taxonomyAgentRun.tokenUsageandcontextExhaustionRiskfields specifiedCONTEXT_GUARD_FAILdecided (standard retry vs. context-summary retry)