Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .changeset/prompt-processing-indicators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
"@ai-sdk-tool/harness": patch
"@ai-sdk-tool/tui": patch
"@ai-sdk-tool/headless": patch
"plugsuits": patch
---

Surface the "prompt processing" state that previously looked frozen, and fix follow-up correctness gaps found during post-implementation review.

- Harness: new `LoopHooks.onStreamStart` / `onFirstStreamPart` hooks wrap the `agent.stream()` call site so consumers driving turns through `runAgentLoop` can react to the prompt-processing latency gap. `onFirstStreamPart` receives the current stream part as its first argument (`TextStreamPart<ToolSet>`) so consumers can inspect `part.type` to filter framing chunks (`start`, `text-start`, …) from visible content. `TextStreamPart` is re-exported from the harness root for convenience. Docstring clarifies that the TUI has its own independent `onStreamStart` on `AgentTUIConfig`.
- TUI: shows a `Processing...` loader during turn preparation and transitions to `Working...` once the LLM request is in flight. The startup token probe is now non-blocking (fire-and-forget) so the editor accepts input immediately; the context-usage footer starts from the estimated count and quietly upgrades to the real value. During a blocking compaction the foreground loader temporarily switches to `Compacting...` and restores the previous label when the block ends, so users see the actual reason for a long wait. `text-start` stream parts are now treated as visible, clearing the streaming loader as soon as the assistant view mounts (no more empty-view flicker).
- Headless: emits a `turn-start` lifecycle annotation and a matching `onStreamStart` callback before each LLM request; the event is dropped from `trajectory.json` (transient UX signal, no `step_id`) so persisted consumers see identical output. The event fires exactly once per logical turn — overflow and no-output retries no longer re-emit it. New tests cover normal ordering, `new-turn` vs `intermediate-step` phases, retry single-emission, and non-persistence in `trajectory.json`.
- Headless: the persisted `schema_version` is corrected from the internal `ATIF-v1.6` label to the actual current Harbor spec version `ATIF-v1.4` (<https://www.harborframework.com/docs/agents/trajectory-format>). Documentation across `packages/headless/AGENTS.md`, `packages/headless/README.md`, and `packages/cea/benchmark/AGENTS.md` now separates the internal JSONL streaming protocol (which carries lifecycle annotations such as `approval`, `compaction`, `interrupt`, `turn-start`) from the ATIF-v1.4 trajectory that `TrajectoryCollector` writes to disk.
- Headless: `StepMetrics` gains the remaining ATIF-v1.4 optional fields (`logprobs`, `prompt_token_ids`, `completion_token_ids`) and `TrajectoryJson.final_metrics` now aggregates `total_cost_usd`. `TrajectoryJson.extra` is typed as a closed record of exactly the three ATIF persistence buckets (`approval_events`, `compaction_events`, `interrupt_events`); new lifecycle types must extend the interface explicitly so the Harbor persistence contract stays type-enforced.
- CEA: the `--atif` CLI help text and the benchmark pipeline now reference ATIF-v1.4 (matching the corrected `schema_version`). The bundled `packages/cea/benchmark/test_trajectory.py` validator now calls Harbor's official `TrajectoryValidator` when `harbor` is importable and falls back to a stricter local shape check otherwise; it enforces per-step metric shapes and rejects `bool` values where ATIF requires a real number.
- Addressed PR review feedback:
- `turn-start` and `onStreamStart` now fire strictly after `agent.stream()` successfully returns, so stream-creation failures no longer produce a false "stream started" signal (reported by Gemini, Codex, and Cubic reviewers).
- The background startup usage probe is serialized against per-turn probes by a generation token; a stale startup probe can no longer overwrite newer usage data and skew context-pressure metrics.
- The blocking-compaction spinner swap only stashes the original foreground label on first entry and only restores it when the foreground loader is still live, eliminating both the "Compacting..." wording sticking after unblock and the "Processing..." spinner resurrecting after the first stream part arrived.
- Restored the post-`onSetup` `updateHeader()` call that was accidentally dropped when the startup probe became non-blocking, so any header/footer state that `onSetup` initialises renders immediately instead of waiting for the first probe to resolve.
- The bundled Python ATIF validator (`test_trajectory.py`) no longer accepts `bool` values where ATIF v1.4 requires a real number — `isinstance(True, int)` is `True` in Python, so the old check let invalid metric payloads slip through. Added `_is_real_number` / `_is_real_int` helpers that exclude `bool`.
- Observer hooks (`onStreamStart`, `onFirstStreamPart`) no longer abort a valid stream when the callback throws. Errors are logged via `console.error` and swallowed in the harness loop, headless runner, and TUI session loop, with the contract documented on `LoopHooks`.
- Repaired a regression where `LoopHooks.onToolCall` had silently dropped out of the public `LoopHooks` interface while still being destructured inside `runAgentLoop`. The field is restored to its original signature; consumers that already relied on it are unaffected, and the destructuring now type-checks again.
- Corrected the `LoopHooks.onFirstStreamPart` signature as a pre-adoption fix (Cubic P2): the previous `(context) => void` shape promised in its docstring that consumers could filter on part type, but the callback never received the part. The signature now passes `(part: TextStreamPart<ToolSet>, context)` so consumers can actually inspect `part.type`. Zero existing consumers were found across the monorepo (the hook was introduced earlier in this PR), so this is a type-only correction with no runtime migration. New regression tests in `loop.test.ts` cover single-fire semantics, per-iteration firing, empty-stream skip, and observer-error isolation.
- Pinned the ATIF v1.4 compliance contract in-source: `trajectory-collector.ts`, `TrajectoryJson`, `AtifStep`, `TrajectoryEvent`, `collectTrajectoryEvent`, and `runHeadless` now carry module/interface-level JSDoc spelling out the Harbor spec version, the allowed `steps[*].source` values, the `extra.*` persistence rule, and the stream-vs-snapshot boundary. `packages/headless/AGENTS.md` gains an "ATIF v1.4 COMPLIANCE" section listing the same invariants, and the `atif-events.test.ts` suite now declares itself as the executable compliance contract. These are docs-only, but they turn future spec drifts into obvious code-review red flags instead of silent regressions.
- Review cycle 1 follow-ups (Oracle + Gemini + Codex + Cubic + CodeRabbit):
- Guarded `TrajectoryCollector.writeTo` against persisting an invalid zero-step trajectory (Harbor's own validator rejects `steps: []`). The method now returns `boolean` — `true` when a file was written, `false` when the write was intentionally skipped to keep `trajectory.json` ATIF-v1.4 compliant.
- Moved the TUI `showLoader("Processing...")` call inside the stream-turn `try/finally` so a thrown `prepareMessages` (or `onBeforeTurn`/usage probe/compaction check) no longer leaves the spinner stuck on screen.
- Tightened the startup usage-probe guard: in addition to the generation token, `measureUsageIfAvailable` now captures `messageHistory.getRevision()` at call time and drops its result when the history has mutated mid-probe, preventing stale empty-message usage from overwriting per-turn measurements.
- Narrowed `TrajectoryJson.extra` to the three canonical lifecycle buckets (`approval_events`, `compaction_events`, `interrupt_events`) by dropping the `Record<string, unknown>` intersection. New lifecycle types must now extend the interface explicitly, keeping the ATIF persistence contract type-enforced.
- Hardened the Python validator: `_is_real_number` now rejects `NaN`, `Infinity`, and `-Infinity` (all of which `json.loads` will happily produce from non-strict JSON) via an explicit `math.isfinite` check.
- Corrected documentation drift across `packages/headless/AGENTS.md`, `packages/headless/README.md`, `packages/headless/src/types.ts`, `packages/headless/src/trajectory-collector.ts`, and the root `AGENTS.md`: `approval`/`compaction`/`interrupt` are persisted under `trajectory.extra.*`, not JSONL-only; only `turn-start` and `error` are transient.
- Regression test added for the `writeTo` zero-step guard: `does not write an invalid zero-step trajectory when the stream fails before any step`.
- Review cycle 2 follow-ups (Oracle re-audit):
- Headless `measureUsageIfAvailable` now carries the same generation + revision guards the TUI already had. A slow background probe that resolves after a compaction or a newer per-turn probe no longer overwrites fresh usage data.
- ATIF v1.4 step source contract aligned across code, Python validator, and benchmark docs: `user`, `agent`, and `system` are all permitted (Harbor v1.2+). Previous divergence between `AtifStep.source` and `test_trajectory.py`'s `valid_sources = {user, agent}` is resolved.
- Root `README.md` headless event list now includes `turn-start` and points at Harbor's ATIF-v1.4 schema for the persisted trajectory.
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ try {

- File edits in CEA favor hashline-aware operations (`LINE#HASH` + `expected_file_hash`) for stale-safe modifications.
- Manual tool-loop continuation is intentionally constrained to normalized `tool-calls` finish reasons.
- Headless mode emits structured ATIF JSONL lifecycle types (`metadata`, `step`, `approval`, `compaction`, `error`, `interrupt`) consumed by benchmark tooling.
- Headless mode emits a JSONL event stream with lifecycle types `metadata`, `step`, `approval`, `compaction`, `error`, `interrupt`, and `turn-start`. The persisted `trajectory.json` produced by `TrajectoryCollector` follows Harbor's ATIF-v1.4 schema (<https://www.harborframework.com/docs/agents/trajectory-format>): `approval`, `compaction`, and `interrupt` are bundled into `extra.*` buckets; `turn-start` and `error` are JSONL-only.
- `SkillsEngine` discovers skills from up to five directories: bundled, global skills, global commands, project skills, project commands.

## COMMANDS
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Available commands:
pnpm run headless -- "Fix the type error in src/index.ts"
```

Outputs structured ATIF JSONL events (`metadata`, `step`, `approval`, `compaction`, `error`, `interrupt`) for programmatic consumption.
Outputs a JSONL event stream (`metadata`, `step`, `approval`, `compaction`, `error`, `interrupt`, `turn-start`) for programmatic consumption. The persisted `trajectory.json` conforms to Harbor's ATIF-v1.4 schema.

## Architecture

Expand Down
12 changes: 8 additions & 4 deletions packages/cea/benchmark/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Control agent behavior via environment variables:
| `AGENT_ENABLE_THINKING` | `1`, `true`, `yes` | Enable `--think` flag (captures reasoning content) |
| `AGENT_ENABLE_TOOL_FALLBACK` | `1`, `true`, `yes` | Enable `--tool-fallback` flag (XML-based tool calling for non-native models) |

## Event Flow (ATIF-v1.6)
## Event Flow

```
headless.ts (Docker) output.jsonl harbor_agent.py
Expand All @@ -51,9 +51,11 @@ headless.ts (Docker) output.jsonl harbor_agent.py
│ │ │
├─► emit InterruptEvent ───► interrupt ─────────────► lifecycle annotation
│ │ │
├─► emit TurnStartEvent ───► turn-start ────────────► lifecycle annotation (not persisted)
│ │ │
└─► emit StepEvent(agent) ─► step (agent) ──────────► Step(source="agent")
│ │
└───────────────────────► trajectory.json (ATIF-v1.6, written by headless)
└───────────────────────► trajectory.json (ATIF-v1.4, written by headless)
```

## Event Types (output.jsonl)
Expand All @@ -66,13 +68,15 @@ headless.ts (Docker) output.jsonl harbor_agent.py
| `compaction` | `event`, `tokensBefore`, `tokensAfter?`, `durationMs?` | History compaction events |
| `error` | `error`, `timestamp` | Fatal errors |
| `interrupt` | `reason`, `timestamp` | Intentional caller interruption |
| `turn-start` | `phase`, `timestamp` | Lifecycle annotation emitted once per logical turn right after `agent.stream()` dispatch; dropped by `TrajectoryCollector` and absent from `trajectory.json` |

## Verification

### 1. Event Type Distribution
```bash
cat jobs/<job_id>/*/agent/output.jsonl | jq -r '.type' | sort | uniq -c
# Expected output like: 1 metadata N step M compaction K approval optional interrupt (no unexpected 'error' lines)
# Expected output like: 1 metadata N step N turn-start M compaction K approval optional interrupt (no unexpected 'error' lines)
# Note: turn-start count should match the number of logical turns (== agent step count for linear conversations).
```

### 2. Step ID Sequence
Expand All @@ -88,7 +92,7 @@ python -m harbor.utils.trajectory_validator jobs/<job_id>/*/agent/trajectory.jso
```

Validator expectations:
- `steps[*].source` is currently `user` or `agent`
- `steps[*].source` is `user`, `agent`, or `system` (ATIF v1.4 permits all three; system steps support observations since v1.2)
- bundled tool observations live in `steps[*].observation.results`
- persisted lifecycle annotations, when present, live under `extra.approval_events`, `extra.compaction_events`, and `extra.interrupt_events`

Expand Down
2 changes: 1 addition & 1 deletion packages/cea/benchmark/scorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def parse_timestamp(ts: str | None) -> datetime | None:


def score_trajectory(trajectory: dict) -> dict:
"""Compute performance metrics from an ATIF-v1.6 trajectory dict."""
"""Compute performance metrics from an ATIF-v1.4 trajectory dict."""
steps = trajectory.get("steps", [])
fm = trajectory.get("final_metrics", {}) or {}
compaction_events = trajectory.get("extra", {}).get("compaction_events", [])
Expand Down
95 changes: 87 additions & 8 deletions packages/cea/benchmark/test_trajectory.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
#!/usr/bin/env python3
"""ATIF-v1.6 trajectory validation test.
"""ATIF-v1.4 trajectory validation test.

Usage: python3 test_trajectory.py <trajectory.json>
"""

from __future__ import annotations
import json
import math
import sys
from pathlib import Path


def _is_real_number(value: object) -> bool:
if isinstance(value, bool) or not isinstance(value, (int, float)):
return False
return math.isfinite(float(value))


def _is_real_int(value: object) -> bool:
return isinstance(value, int) and not isinstance(value, bool)
Comment thread
coderabbitai[bot] marked this conversation as resolved.


def validate_trajectory(path: str) -> list[str]:
"""Validate trajectory.json against ATIF-v1.6 spec. Returns list of errors."""
"""Validate trajectory.json against ATIF-v1.4 spec. Returns list of errors."""
errors = []

try:
Expand All @@ -21,9 +32,9 @@ def validate_trajectory(path: str) -> list[str]:
return [f"Cannot read/parse file: {e}"]

# 1. schema_version
if t.get("schema_version") != "ATIF-v1.6":
if t.get("schema_version") != "ATIF-v1.4":
errors.append(
f"schema_version: expected 'ATIF-v1.6', got {t.get('schema_version')!r}"
f"schema_version: expected 'ATIF-v1.4', got {t.get('schema_version')!r}"
)

# 2. session_id present
Expand Down Expand Up @@ -53,8 +64,9 @@ def validate_trajectory(path: str) -> list[str]:
if step_ids != expected:
errors.append(f"step_ids: expected {expected}, got {step_ids}")

# 6. each step has required fields
valid_sources = {"user", "agent"}
# 6. each step has required fields. ATIF v1.4 permits "user", "agent",
# and "system" as step sources (system steps support observations since v1.2).
valid_sources = {"user", "agent", "system"}
for i, step in enumerate(steps):
if not isinstance(step, dict):
continue
Expand Down Expand Up @@ -84,8 +96,51 @@ def validate_trajectory(path: str) -> list[str]:
if not isinstance(fm, dict):
errors.append("final_metrics must be a dictionary")
return errors
if not isinstance(fm.get("total_steps"), int):
if not _is_real_int(fm.get("total_steps")):
errors.append("final_metrics.total_steps: must be an integer")
for token_field in (
"total_prompt_tokens",
"total_completion_tokens",
"total_cached_tokens",
"total_cost_usd",
):
value = fm.get(token_field)
if value is not None and not _is_real_number(value):
errors.append(
f"final_metrics.{token_field}: must be a number or null, got {type(value).__name__}"
)

# 8b. per-step metrics shape
for i, step in enumerate(steps):
if not isinstance(step, dict):
continue
metrics = step.get("metrics")
if metrics is None:
continue
if not isinstance(metrics, dict):
errors.append(f"steps[{i}].metrics: must be a dictionary when present")
continue
for num_field in (
"prompt_tokens",
"completion_tokens",
"cached_tokens",
"cost_usd",
):
value = metrics.get(num_field)
if value is not None and not _is_real_number(value):
errors.append(
f"steps[{i}].metrics.{num_field}: must be a number when present"
)
for list_field in (
"logprobs",
"prompt_token_ids",
"completion_token_ids",
):
value = metrics.get(list_field)
if value is not None and not isinstance(value, list):
errors.append(
f"steps[{i}].metrics.{list_field}: must be a list when present"
)

# 9. persisted lifecycle annotations under extra
extra = t.get("extra")
Expand All @@ -104,6 +159,22 @@ def validate_trajectory(path: str) -> list[str]:
return errors


def run_harbor_validator(path: str) -> list[str] | None:
"""Run Harbor's official trajectory_validator when the harbor package is
importable. Returns None when Harbor isn't installed so the caller can
fall back to the bundled validator."""
try:
from harbor.utils.trajectory_validator import TrajectoryValidator
except ImportError:
return None

validator = TrajectoryValidator()
is_valid = validator.validate(path)
if is_valid:
return []
return [f"harbor: {err}" for err in validator.get_errors()]


def main() -> None:
if len(sys.argv) < 2:
print("Usage: python3 test_trajectory.py <trajectory.json>")
Expand All @@ -112,6 +183,11 @@ def main() -> None:
path = sys.argv[1]
errors = validate_trajectory(path)

harbor_errors = run_harbor_validator(path)
harbor_used = harbor_errors is not None
if harbor_errors:
errors.extend(harbor_errors)

if errors:
print(f"VALIDATION FAILED: {path}")
for e in errors:
Expand All @@ -128,7 +204,7 @@ def main() -> None:
print(f" session_id: {t.get('session_id')}")
print(f" steps: {len(steps)}")
print(
f" final_metrics: total_prompt={fm.get('total_prompt_tokens')}, total_completion={fm.get('total_completion_tokens')}"
f" final_metrics: total_prompt={fm.get('total_prompt_tokens')}, total_completion={fm.get('total_completion_tokens')}, total_cost={fm.get('total_cost_usd')}"
)
extra = t.get("extra", {}) or {}
print(
Expand All @@ -137,6 +213,9 @@ def main() -> None:
f"compaction={len(extra.get('compaction_events', []))}, "
f"interrupt={len(extra.get('interrupt_events', []))}"
)
print(
f" harbor_validator: {'passed' if harbor_used else 'skipped (harbor package not installed)'}"
)
sys.exit(0)


Expand Down
Loading
Loading