feat: surface prompt-processing state and unblock TUI startup by minpeter · Pull Request #117 · minpeter/plugsuits

minpeter · 2026-04-21T07:18:16Z

Summary

Users were seeing a frozen UI during two separate silent gaps in the agent lifecycle. This change surfaces both with a loading indicator and makes the startup probe non-blocking.

Gap 1 — per-turn prompt processing (user hits Enter → LLM stream starts): Local prep (`onBeforeTurn`, `measureUsage`, compaction) had no spinner at all. Now shows `Processing...` during prep and transitions to `Working...` once the LLM request is in flight.
Gap 2 — startup blocking probe (first launch blocks input for 200 ms–3 s): The initial `measureUsageIfAvailable([])` call that measures system-prompt + tool-schema token overhead now runs fire-and-forget. The context-usage footer starts on the estimated value and silently upgrades to the real one once the probe resolves.

Changes by package

`@ai-sdk-tool/harness` (patch)

New `LoopHooks.onStreamStart` — fires after `agent.stream()` is dispatched, before `fullStream` iteration begins. Closest hook to "request sent, waiting for first chunk".
New `LoopHooks.onFirstStreamPart` — fires exactly once per iteration when any part arrives. Useful for clearing prompt-processing indicators.

`@ai-sdk-tool/tui` (patch)

Two-stage foreground loader: `Processing...` → `Working...` → cleared on first visible stream part.
`AgentTUIConfig.onStreamStart` callback exposed so consumers can hook the boundary.
Startup `measureUsage([])` probe is now fire-and-forget; editor accepts input immediately on launch.
New top-level `ignore` helper + `runBackgroundStartupProbe` to keep the fire-and-forget semantics self-documenting (no empty catch blocks).

`@ai-sdk-tool/headless` (patch)

New `TurnStartEvent` (`type: "turn-start"`) lifecycle annotation emitted between `agent.stream()` dispatch and the first chunk. Carries no `step_id` (consistent with other lifecycle events).
New `HeadlessRunnerConfig.onStreamStart` callback mirrors the TUI hook.
ATIF consumers unaffected: `TrajectoryCollector` routes unknown event types to `default` and drops `turn-start` from `trajectory.json` (transient UX signal).
`packages/headless/AGENTS.md` and `README.md` updated with the new event type, shape, and TypeScript export.

UX before / after

Before (per-turn prep):
```
[Enter] → [0–500 ms silent freeze] → 'Working...' → first token
└── GAP
```

After (per-turn prep):
```
[Enter] → 'Processing...' → 'Working...' → cleared on first visible part
└── prep (onBeforeTurn, measureUsage, compaction)
└── LLM request in flight (TTFB)
```

Before (first launch):
```
tui.start() → onSetup() → await measureUsage([]) ★ 200ms–3s blocking ★ → editor accepts input
```

After (first launch):
```
tui.start() → onSetup() → runBackgroundStartupProbe() ★ returns immediately ★ → editor accepts input
└── background: footer updates when probe resolves
```

Compatibility

All new hooks/fields are optional; existing consumers compile and run unchanged.
`TrajectoryEvent` is a union extension — existing `switch` consumers hit their `default` branch and ignore `turn-start`.
`trajectory.json` output is identical (`turn-start` is JSONL-only, not persisted).
`session-tui` / `session-headless` adapters pass the new `onStreamStart` through automatically via their existing `Omit` spread pattern.

Verification

`pnpm run typecheck` — 6/6 packages pass
`pnpm run check` (ultracite) — clean
`pnpm run test` — all 566 tests pass; headless `runner.test.ts` output now includes `turn-start` events
`pnpm run build` — 6/6 packages succeed

Changeset

`prompt-processing-indicators.md` — patch bump for `@ai-sdk-tool/harness`, `@ai-sdk-tool/tui`, `@ai-sdk-tool/headless`. No consumer code breaks; purely additive public API.

Summary by cubic

Adds clear “Processing…” → “Working…” indicators and makes TUI startup input non‑blocking. Aligns persisted trajectories with Harbor ATIF‑v1.4 and tightens guards, event ordering, and validation.

New Features
- @ai-sdk-tool/tui: Two‑stage loader; non‑blocking startup usage probe; new onStreamStart; treats text-start as visible.
- @ai-sdk-tool/harness: New LoopHooks.onStreamStart and onFirstStreamPart(part, context); re‑exports TextStreamPart.
- @ai-sdk-tool/headless: Emits transient turn-start and onStreamStart; persists ATIF‑v1.4 with logprobs, prompt_token_ids, completion_token_ids, and final_metrics.total_cost_usd; docs clarify JSONL vs ATIF; CEA CLI/validator updated to ATIF‑v1.4 and calls Harbor when available; allows steps[*].source = user | agent | system; TrajectoryJson.extra is a closed record of approval_events/compaction_events/interrupt_events.
Bug Fixes
- Headless: turn-start fires once per logical turn and only after agent.stream() succeeds; never persisted; zero‑step trajectories are not written; usage probe guarded by generation + history revision.
- TUI: showLoader("Processing...") moved into the turn try block; blocking compaction shows “Compacting…” and restores correctly; startup probe is fire‑and‑forget with generation + revision guards; updateHeader() restored after onSetup.
- Hooks: onStreamStart/onFirstStreamPart errors are logged and swallowed; restored LoopHooks.onToolCall type; onFirstStreamPart now receives (part, context).
- Validator/docs: Rejects bool and non‑finite numbers; root docs list turn-start and link Harbor ATIF‑v1.4; changeset text corrected to reflect the closed TrajectoryJson.extra contract.

^{Written for commit 8df785b. Summary will update on new commits.}

Summary by CodeRabbit

릴리스 노트

새로운 기능
- 프롬프트 처리 중 상태 표시기 추가 (Processing... → Working... 전환)
- 스트림 시작 시 콜백 지원
개선사항
- 비용 추적 및 토큰 메트릭 집계 향상
- 차단 작업 중 로더 상태 복원 기능 추가
- 궤적 검증 강화 및 메타데이터 정확성 개선

Previously users saw a frozen UI during two separate silent gaps: 1. Between hitting Enter and the LLM stream starting — during local prep (onBeforeTurn, measureUsage, compaction) there was no spinner at all. 2. On first launch the startup token probe blocked the input loop for 200ms–3s while it measured system-prompt + tool-schema token overhead. This change: - Adds onStreamStart and onFirstStreamPart hooks to the harness LoopHooks so any agent runtime can signal the exact prompt-processing window. - In the TUI, shows a 'Processing...' loader during turn preparation and transitions to 'Working...' once the LLM request is in flight (cleared on the first visible stream part). - In headless, emits a 'turn-start' lifecycle annotation and a matching onStreamStart callback; extends the TrajectoryEvent union accordingly. - Runs the TUI startup measureUsage probe fire-and-forget so the editor accepts input immediately; the context-usage footer starts on the estimated value and quietly upgrades to the real one when ready. Documents the new turn-start event in the headless AGENTS.md / README.

coderabbitai · 2026-04-21T07:18:26Z

Important

Review skipped

Auto reviews are disabled on this repository. To trigger a review, include @crb review in the PR description. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ffa0e2af-d65a-4c20-9b7f-b4d9fdacf745

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

워크스루

여러 패키지에 걸쳐 좌표화된 패치를 도입하여 이전에 정지된 것처럼 보였던 프롬프트 처리/턴 수명주기 동작을 표면화하고 수정합니다. onStreamStart 및 onFirstStreamPart 콜백을 통해 스트림 시작 신호를 추가하고, turn-start 생명주기 주석을 내보내며, ATIF 스키마를 v1.6에서 v1.4로 마이그레이션하고, TUI에서 처리 상태 표시를 개선하고, 옵션 메트릭 필드를 확장합니다.

변경사항

코호트 / 파일(들)	요약
ATIF 스키마 마이그레이션 `packages/headless/src/trajectory-collector.ts`, `packages/cea/src/entrypoints/main.ts`, `packages/cea/benchmark/scorer.py`, `packages/cea/benchmark/test_trajectory.py`	스키마 버전을 ATIF v1.6에서 v1.4(Harbor 호환)로 업데이트하고, `total_cost_usd` 메트릭 필드를 추가하며, 검증 로직을 강화하여 `bool` 타입을 숫자 필드에서 거부합니다.
생명주기 이벤트 시스템 `packages/headless/src/types.ts`, `packages/headless/src/runner.ts`, `packages/headless/src/__tests__/atif-events.test.ts`, `packages/headless/src/runner.test.ts`	`turn-start` 이벤트 타입을 도입하고, `agent.stream()` 호출 후 정확히 한 번 내보내며, 이벤트를 JSONL 스트림에만 유지하고 지속된 trajectory.json에서는 제외합니다.
하네스 옵저버 훅 `packages/harness/src/types.ts`, `packages/harness/src/loop.ts`, `packages/harness/src/index.ts`, `packages/harness/src/loop.test.ts`	`onStreamStart` 및 `onFirstStreamPart` 콜백을 추가하고 오류 격리 처리를 구현하며, `TextStreamPart` 타입을 재내보냅니다.
TUI 처리 표시 개선 `packages/tui/src/agent-tui.ts`, `packages/tui/src/stream-handlers.ts`	스트림 시작 전후로 로더를 "Processing..."에서 "Working..."으로 전환하고, 사용량 프로빙을 비차단으로 만들고, 압축 중 상태를 관리하며, `text-start`를 가시적으로 처리합니다.
문서 및 검증 개선 `packages/headless/AGENTS.md`, `packages/headless/README.md`, `packages/cea/benchmark/AGENTS.md`	ATIF v1.4 규정 요구사항, `turn-start` 이벤트 이동 경로, 스트림 프로토콜을 명확히 하는 문서를 업데이트합니다.

시퀀스 다이어그램

sequenceDiagram
    participant Client
    participant Harness as Harness<br/>(runAgentLoop)
    participant Stream as Agent<br/>Stream
    participant Observer as Observer<br/>(onStreamStart,<br/>onFirstStreamPart)
    participant ProcessLoop as Stream<br/>Processing Loop

    Client->>Harness: runAgentLoop(...)
    Harness->>Stream: agent.stream()
    Stream-->>Harness: stream created
    Harness->>Observer: onStreamStart(context)
    Observer-->>Harness: [hook completes]
    Harness->>ProcessLoop: start fullStream iteration
    ProcessLoop->>Stream: await first chunk
    Stream-->>ProcessLoop: part[0]
    ProcessLoop->>Observer: onFirstStreamPart(part[0], context)
    Observer-->>ProcessLoop: [hook completes]
    ProcessLoop->>ProcessLoop: continue iteration
    ProcessLoop->>Stream: await next chunk
    Stream-->>ProcessLoop: part[1..n]
    ProcessLoop->>ProcessLoop: process all parts
    ProcessLoop-->>Harness: iteration complete
    Harness-->>Client: loop result

sequenceDiagram
    participant Client
    participant Headless as Headless<br/>(runHeadless)
    participant Collector as Trajectory<br/>Collector
    participant Stream as Agent<br/>Stream
    participant JSONL as JSONL<br/>Emitter
    participant File as trajectory.json<br/>(ATIF v1.4)

    Client->>Headless: runHeadless(config)
    Headless->>Stream: agent.stream()
    Stream-->>Headless: stream created
    Headless->>JSONL: emit TurnStartEvent
    JSONL->>Collector: collect turn-start
    Collector->>Collector: [DROP - transient]
    JSONL-->>Client: stdout JSONL
    Headless->>Headless: processStream()
    Headless->>JSONL: emit step events
    JSONL->>Collector: collect steps
    Collector->>Collector: [PERSIST]
    Headless->>Collector: finalize()
    Collector->>Collector: aggregate metrics<br/>(total_cost_usd)
    Collector->>File: write trajectory.json<br/>(schema_version: v1.4)
    File-->>Client: persisted ATIF v1.4

예상 코드 리뷰 노력

🎯 4 (복잡) | ⏱️ ~45분

시

🐰 턴이 시작되는 순간을 캐치해,
스트림이 흐르기 전에 신호를 보내,
비용은 합산되고 v1.4로 춤을 춘다네,
옵저버들은 오류도 잘 견디며,
처리 중인 우리 에이전트, 멋진 여행이야! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	제목은 PR의 주요 변경사항을 정확하게 반영하고 있습니다. 'prompt-processing 상태 표면화'와 'TUI 시작 차단 해제'라는 두 가지 핵심 목표가 간결하고 명확하게 표현되어 있습니다.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/prompt-processing-indicators

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces "prompt processing" indicators across the harness, TUI, and headless packages to improve user feedback during LLM request latency. It adds onStreamStart and onFirstStreamPart hooks to the agent loop and a new turn-start event for headless mode. In the TUI, the startup token probe is now non-blocking. Feedback includes addressing a race condition in the TUI's background startup probe and ensuring that the onStreamStart hook and related UI states are triggered after the agent stream is dispatched for consistency and accurate latency signaling.

gemini-code-assist · 2026-04-21T07:20:10Z

  };

+  const runBackgroundStartupProbe = (): void => {
+    measureUsageIfAvailable([]).then(ignore, ignore);


Running measureUsageIfAvailable([]) as a fire-and-forget background task at startup introduces a race condition. If the user quickly starts a turn and that turn's usage probe (which is awaited) finishes before this background probe, the background probe will eventually resolve and overwrite the messageHistory with stale usage data for an empty message list. This would result in an incorrect token count being displayed in the UI footer.

Consider adding a tracking mechanism (e.g., a sequential ID or a timestamp) to measureUsageIfAvailable to ensure that only the result of the most recently requested probe is applied to the state.

gemini-code-assist · 2026-04-21T07:20:10Z

+        emitAndCollect({
+          type: "turn-start",
+          phase,
+          timestamp: new Date().toISOString(),
+        });
+        await config.onStreamStart?.(phase);
+
        const streamPromise = Promise.resolve(
          config.agent.stream(streamOptions)
        );


The onStreamStart hook and turn-start event are currently emitted before the agent stream is dispatched. According to the LoopHooks documentation and the implementation in runAgentLoop, this hook should fire immediately after agent.stream() is invoked. This ensures that the "Working..." or "turn-start" state accurately reflects the period where the request is in flight, and avoids triggering the hook if agent.stream() itself throws an error during initialization.

Suggested change

emitAndCollect({

type: "turn-start",

phase,

timestamp: new Date().toISOString(),

});

await config.onStreamStart?.(phase);

const streamPromise = Promise.resolve(

config.agent.stream(streamOptions)

);

const streamPromise = Promise.resolve(

config.agent.stream(streamOptions)

);

emitAndCollect({

type: "turn-start",

phase,

timestamp: new Date().toISOString(),

});

await config.onStreamStart?.(phase);

gemini-code-assist · 2026-04-21T07:20:10Z

+      showLoader("Working...");
+      await config.onStreamStart?.(phase);


The onStreamStart hook and the "Working..." loader state are triggered before the agent stream is dispatched. For consistency with the harness loop and to accurately signal the start of the LLM request latency gap (TTFB), these should be moved after the config.agent.stream call. This also prevents the "Working..." state from being entered if the stream creation itself fails.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94712f580d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T07:21:45Z

  };

+  const runBackgroundStartupProbe = (): void => {
+    measureUsageIfAvailable([]).then(ignore, ignore);


Prevent stale startup probe from clobbering usage state

Launching measureUsageIfAvailable([]) in the background creates a race with the first real turn’s measureUsageIfAvailable(messages) call, and both paths write to messageHistory.updateActualUsage(...) without any revision guard. If the startup probe resolves last, it can overwrite newer token usage with the empty-message baseline, which underestimates context pressure and can mis-size maxOutputTokens/compaction decisions until another measurement occurs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T07:21:45Z

+          type: "turn-start",
+          phase,
+          timestamp: new Date().toISOString(),
+        });
+        await config.onStreamStart?.(phase);


Emit turn-start after dispatching the stream request

turn-start is emitted (and onStreamStart awaited) before config.agent.stream(streamOptions) is invoked, so failures during stream creation still produce a false "request started" lifecycle event. This breaks the event contract documented as "after dispatch" and can skew downstream latency/phase metrics for consumers that treat turn-start as the network dispatch boundary.

Useful? React with 👍 / 👎.

cubic-dev-ai

2 issues found across 8 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/headless/src/runner.ts">

<violation number="1" location="packages/headless/src/runner.ts:673">
P2: `turn-start` is emitted before `agent.stream()` dispatch, so lifecycle events can claim a stream started when it hasn't.</violation>
</file>

<file name="packages/headless/README.md">

<violation number="1" location="packages/headless/README.md:185">
P3: Document the actual sequencing: `turn-start` is emitted before `agent.stream()` is called, not after it is dispatched.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Post-implementation audit surfaced four issues and two gaps; all are addressed here to keep the PR self-contained. - BUG A (headless): turn-start was re-emitted on every retry, so a single logical turn could produce 2-4 turn-start events (overflow retry + up to 3 no-output retries). Added a hasEmittedTurnStart flag in runSingleTurn scope so the event fires at most once per logical turn; flag resets naturally on the next turn. - BUG B (TUI): during a blocking compaction the foreground loader stayed on 'Processing...' while the real wait was the compaction LLM call. The onBlockingChange callback now swaps the foreground label to 'Compacting...' on block entry and restores the previous label on block exit, so users see the real reason for the wait. - BUG C (TUI): text-start stream parts were not counted as visible, leaving the loader spinning after the empty AssistantStreamView had already mounted. text-start is now visible; reasoning-start follows the existing showReasoning flag. - BUG D (harness): documented that LoopHooks.onStreamStart fires only under runAgentLoop; the TUI has its own independent hook on AgentTUIConfig with a different signature. - GAP 1 (tests): added retry-path assertions that turn-start fires exactly once, plus a new test that confirms normal-path ordering (metadata -> user step -> turn-start -> agent step) and an intermediate-step phase test for tool-continuation turns. - GAP 1b (tests): added a test that turn-start events never appear in the persisted trajectory.json even though they stream on JSONL. - GAP 2 (docs): updated packages/cea/benchmark/AGENTS.md event table and flow diagram to include turn-start and note that it is not persisted.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87450f1082

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T07:39:58Z

    await config.onSetup?.();
-    await measureUsageIfAvailable([]);
-    updateHeader();
+
+    runBackgroundStartupProbe();



Refresh header after onSetup completes

updateHeader() now runs only before onSetup, and this block immediately starts the background probe without a post-setup refresh. If onSetup asynchronously initializes header/footer state (for example, setting a resolved model subtitle), the rendered header stays stale until an unrelated later render (such as a usage probe update or first turn), so startup metadata can be wrong for the entire idle session.

Useful? React with 👍 / 👎.

cubic-dev-ai

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/tui/src/agent-tui.ts">

<violation number="1" location="packages/tui/src/agent-tui.ts:931">
P2: When blocking compaction ends you always restore the stashed foreground loader, even if `clearStatus()` already removed it (e.g., once the first stream part arrived). That resurrects a stale “Processing…” spinner and makes the UI look frozen again. Guard the restoration (and clear the stash) when the loader was cleared during the block.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

The persisted trajectory was tagged 'ATIF-v1.6', which does not exist on Harbor's public spec. The current Harbor ATIF version is v1.4 (see https://www.harborframework.com/docs/agents/trajectory-format): v1.0, v1.1, v1.2, v1.3, v1.4 — v1.5 and v1.6 were never released. The 'v1.6' label was an internal bump accompanying a redesign of the JSONL event types on 2026-04-03; it conflated two distinct surfaces: - The ATIF trajectory that TrajectoryCollector writes to disk, which must follow Harbor's published schema (v1.4). - The internal stdout JSONL protocol used by the headless runner, which carries lifecycle annotations (approval, compaction, interrupt, turn-start) that ATIF does not define. This commit: - Sets schema_version to 'ATIF-v1.4' in TrajectoryCollector (both the TypeScript type literal and the runtime value). - Updates the Python validator (test_trajectory.py), Python scorer, CLI help text, and test assertions to expect 'ATIF-v1.4'. - Rewrites packages/headless/AGENTS.md, README.md, and packages/cea/benchmark/AGENTS.md to separate the ATIF persisted format from the internal JSONL streaming protocol, with a pointer to Harbor's spec page. - Adds a JSDoc header on TrajectoryEvent types explaining the split.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49cd7261de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T08:02:58Z

+            if (foregroundStatusMessage !== null) {
+              foregroundStatusBeforeBlocking = foregroundStatusMessage;
+              showLoader("Compacting...");


Capture prior spinner text only once per blocking compaction

onBlockingChange can emit multiple blocking: true stages for a single compaction (starting → pruning → compacting from CompactionOrchestrator.blockAtHardLimit/handleOverflow), but this block overwrites foregroundStatusBeforeBlocking every time. After the first stage, the foreground text is already "Compacting...", so later stages replace the original "Processing..."/"Working..." snapshot, and the unblock path restores the wrong label. This breaks the intended status restoration and leaves users with stale compaction wording after blocking ends.

Useful? React with 👍 / 👎.

…validator After correcting the schema_version label to 'ATIF-v1.4', audit against the official spec (https://www.harborframework.com/docs/agents/trajectory-format) surfaced three remaining gaps. This commit closes them. - StepMetrics now exposes the full v1.4 optional surface: - logprobs: number[] - prompt_token_ids: number[] (added in v1.4) - completion_token_ids: number[] (added in v1.3) These remain undefined unless the provider populates them, so existing output is bit-for-bit identical for callers that don't set the new fields. - TrajectoryJson.final_metrics now aggregates total_cost_usd across step metrics with the same null-when-absent semantics as the other token totals. The ATIF compliance test suite already expected this field in ATIF_FINAL_METRICS_ALLOWED_FIELDS; the emission side just caught up to the test. - TrajectoryJson.extra is typed as the approval/compaction/interrupt container intersected with Record<string, unknown> so downstream additions to extra.* do not require breaking type changes. - packages/cea/benchmark/test_trajectory.py now: - Calls harbor.utils.trajectory_validator.TrajectoryValidator when the harbor package is importable (auto-skips when Harbor is not installed so local developer workflows keep working). - Enforces per-step metric shapes: numeric fields (prompt_tokens, completion_tokens, cached_tokens, cost_usd) must be numbers; id fields (logprobs, prompt_token_ids, completion_token_ids) must be arrays. - Prints total_cost_usd in the summary block and reports whether the Harbor validator was used or skipped. - New tests: - finalize() aggregates total_cost_usd across step metrics - finalize() returns null total_cost_usd when no step reports cost - finalize() preserves all v1.4 optional metric fields (logprobs, prompt_token_ids, completion_token_ids) Verified: typecheck, ultracite check, full vitest (569 tests), and turbo build all pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52acdd4be9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T08:15:09Z

+      await config.onStreamStart?.(phase);
+
      const stream = await config.agent.stream(


Invoke onStreamStart after dispatching agent.stream

onStreamStart is awaited before config.agent.stream(...) is called, so consumers can observe a false "stream started" signal when stream creation fails or stalls. This is especially problematic for telemetry/hooks that treat this callback as the request-dispatch boundary, because it shifts latency measurements earlier and can report an in-flight state even when no request was actually sent.

Useful? React with 👍 / 👎.

cubic-dev-ai

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/cea/benchmark/test_trajectory.py">

<violation number="1" location="packages/cea/benchmark/test_trajectory.py:96">
P2: Per-step numeric metric validation also accepts booleans as numbers, allowing invalid metric payloads to pass.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Six issues raised across three AI review bots on PR #117: BUG (Gemini/Codex/Cubic - Medium): turn-start and onStreamStart fired before agent.stream() was invoked, breaking the documented contract 'fires immediately after agent.stream is invoked'. A stream-creation failure could still emit a false 'stream started' lifecycle event and skew telemetry for consumers that treat turn-start as the network dispatch boundary. Moved the emission/hook in both the headless runner and the TUI's runSingleStreamTurn to run after agent.stream() successfully returns. BUG (Gemini/Codex - High): Background startup usage probe raced with the first turn's per-turn probe; both paths wrote to messageHistory.updateActualUsage with no revision guard. A stale startup probe (based on an empty message list) could clobber newer usage data and underestimate context pressure, mis-sizing maxOutputTokens and compaction decisions. Added a generation token on measureUsageIfAvailable: probes whose generation is no longer current silently drop their result. BUG (Codex - Medium): onBlockingChange captures prior spinner text to restore after blocking compaction, but the orchestrator can emit multiple blocking:true stages for a single compaction (starting -> pruning -> compacting). The stash was being overwritten every stage, so by unblock time the stashed value was 'Compacting' itself. Only stash when foregroundStatusBeforeBlocking is null. BUG (Cubic - Medium): When the first stream part arrived during a blocking compaction, clearStatus() cleared foregroundStatusMessage, but the unblock path still called showLoader on the stashed value, resurrecting a stale 'Processing...' spinner on what should be a clean UI. Guard the restoration with a foregroundStatusMessage null check so a cleared loader stays cleared. BUG (Codex - Low): updateHeader() used to run right after onSetup to reflect any async header/footer state the consumer initialised there. When the startup probe became non-blocking, the updateHeader call was accidentally dropped. Restored it so header metadata renders before the idle session begins. BUG (Cubic - Low): Python validator accepted booleans as numbers for per-step and final metric values because isinstance(True, int) is True in Python. Added _is_real_number and _is_real_int helpers that explicitly exclude bool, matching ATIF-v1.4 expectations. New tests: - 'does not emit turn-start when agent.stream() rejects before dispatch' - 'emits turn-start after agent.stream() succeeds (before first chunk)' Verified: typecheck, ultracite check, 574 tests (+5 headless), build all pass. Every review-flagged issue has been fixed or explicitly addressed in the commit above or earlier commits on this branch.

minpeter · 2026-04-21T08:28:47Z

Addressed all automated reviewer feedback (commit `245214f`)

Thanks to @gemini-code-assist, @chatgpt-codex-connector, and @cubic-dev-ai for catching these. Every flagged issue has been fixed.

Resolved

Issue	Severity	Reviewers	Fix
`turn-start` / `onStreamStart` fires before `agent.stream()` dispatch	Medium	Gemini, Codex, Cubic	Moved emission and hook invocation to strictly after `agent.stream()` resolves, in both `runner.ts` and `runSingleStreamTurn`. Stream-creation failures no longer produce a false lifecycle signal.
Background startup probe races with per-turn probe	High	Gemini, Codex	Added a generation token on `measureUsageIfAvailable`; a stale probe drops its result when its generation is no longer current.
Stale "Processing…" spinner resurrecting after blocking compaction	Medium	Cubic	Guard the unblock-path `showLoader` with a `foregroundStatusMessage !== null` check so a cleared loader stays cleared.
Multi-stage blocking compaction overwrites the stashed label	Medium	Codex	Only stash the original label on first `blocking: true` entry (`foregroundStatusBeforeBlocking === null`), never on subsequent stages.
`updateHeader()` dropped after `onSetup` when probe became non-blocking	Low	Codex	Restored the call right after `await config.onSetup?.()`.
Python validator accepted `bool` as number	Low	Cubic	Added `_is_real_number` / `_is_real_int` helpers that exclude `bool`.

Tests added

does not emit turn-start when agent.stream() rejects before dispatch
emits turn-start after agent.stream() succeeds (before first chunk)

Verification

pnpm run typecheck — 6/6 packages pass
pnpm run check (ultracite) — clean
pnpm run test — 574 tests pass (headless +5 new)
pnpm run build — 6/6 packages succeed

Re-request review when ready.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 245214f470

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T08:34:40Z

+    showLoader("Processing...");
+
    const preparedTurn = await prepareMessages(phase);


Guard processing spinner with finally cleanup

showLoader("Processing...") now runs before prepareMessages(phase), but the try/finally that calls clearStatus() starts later. If prepareMessages throws (for example from onBeforeTurn, measureUsage, or compaction checks), the function exits without hitting cleanup, so the foreground spinner remains stuck even though the turn failed and an error message is shown.

Useful? React with 👍 / 👎.

minpeter · 2026-04-21T08:35:12Z

@cubic review

cubic-dev-ai · 2026-04-21T08:35:23Z

@cubic review

@minpeter I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

1 issue found across 16 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/tui/src/agent-tui.ts">

<violation number="1" location="packages/tui/src/agent-tui.ts:1561">
P1: `onStreamStart` callback failures currently break the turn flow; guard this hook so observer errors don’t abort a valid stream.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

An observer that throws from onStreamStart (or onFirstStreamPart) used to propagate the error and abort an otherwise-valid stream. This violates the 'observer-only' intent of the hooks and makes a buggy telemetry/logging callback strong enough to break a production turn. - harness loop.ts: introduce invokeObserverHook(hook, hookName, ctx). Errors are console.error-logged and swallowed so the stream iteration always proceeds. Applied to both onStreamStart and onFirstStreamPart. - headless runner.ts: wrap config.onStreamStart in try/catch with the same logging pattern. The turn-start emission itself stays in place, so lifecycle consumers are unaffected. - tui agent-tui.ts: wrap config.onStreamStart in try/catch; the 'Working...' loader and the rest of runSingleStreamTurn proceed even when the consumer's callback throws. - LoopHooks docstrings now document the 'observer-only contract': errors are swallowed after being logged. This is a public API guarantee so consumers can safely rely on it (and don't need to defensively wrap their callbacks). - New headless test: 'continues streaming when onStreamStart throws (observer errors are isolated)' confirms the stream completes, the turn-start event is still emitted, and the thrown error is logged via console.error. Also repairs a drop of LoopHooks.onToolCall from types.ts that had slipped through an earlier edit — loop.ts still destructured it, so typecheck was surfacing the regression now. Verified: typecheck, ultracite check, 575 tests (+1 headless), build all pass.

minpeter · 2026-04-21T09:13:42Z

Addressed Cubic P1: observer-hook error isolation (commit `d855e36`)

Thanks @cubic-dev-ai — a callback that throws from onStreamStart (or its sibling onFirstStreamPart) was indeed strong enough to abort an otherwise-valid stream. That violates the observer-only intent of these hooks.

Fix

packages/harness/src/loop.ts: added an invokeObserverHook(hook, hookName, ctx) helper that try/catch-es the call, logs via console.error with a [harness] prefix, and proceeds. Applied to both onStreamStart and onFirstStreamPart in runAgentLoop.
packages/headless/src/runner.ts: wrapped the onStreamStart call in try/catch with the same log-and-continue pattern; the turn-start lifecycle event still fires.
packages/tui/src/agent-tui.ts: wrapped the onStreamStart call in runSingleStreamTurn; the Working... loader and the subsequent stream render proceed.
LoopHooks docstrings: now explicitly document the "observer-only contract" so consumers can rely on it without defensively wrapping their callbacks.

New test

continues streaming when onStreamStart throws (observer errors are isolated) — confirms stream completion, turn-start emission, and the console.error log path.

Also repaired in the same commit

An earlier edit had accidentally dropped LoopHooks.onToolCall from types.ts while deduplicating an onFirstStreamPart docstring. loop.ts still destructured onToolCall, so tsc correctly flagged the regression after this edit exposed it. Restored the field with its original signature.

Verification

pnpm run typecheck — 6/6 packages
pnpm run check — clean
pnpm run test — 575 tests (headless +1)
pnpm run build — 6/6 packages

cubic-dev-ai · 2026-04-21T09:14:03Z

Addressed Cubic P1: observer-hook error isolation (commit d855e36)

Thanks @cubic-dev-ai — a callback that throws from onStreamStart (or its sibling onFirstStreamPart) was indeed strong enough to abort an otherwise-valid stream. That violates the observer-only intent of these hooks.

Fix

...

@minpeter The implementation in d855e36 correctly isolates observer hooks and resolves the P1 issue. By wrapping these calls in try/catch blocks and logging errors via console.error instead of propagating them, you've ensured that a failure in a telemetry or UI callback won't crash the agent loop or TUI session.

Key highlights from the fix:

Harness Abstraction: The invokeObserverHook helper in packages/harness/src/loop.ts provides a clean, reusable pattern for safely invoking onStreamStart and onFirstStreamPart.
Contract Documentation: Updating the LoopHooks docstrings to explicitly state the "observer-only contract" is a great touch for future-proofing the API.
Robust Testing: The new headless test case successfully verifies that the stream completes and the turn-start event is still emitted even when a hook throws.
Regression Fix: Restoring the accidentally dropped onToolCall field in types.ts ensures the public API remains intact.

This approach perfectly aligns with the intent of these hooks as non-blocking observers. For more details on managing these types of learnings or custom rules in the future, you can check the Cubic documentation.

minpeter · 2026-04-21T09:20:50Z

@cubic review

cubic-dev-ai · 2026-04-21T09:21:01Z

@cubic review

@minpeter I have started the AI code review. It will take a few minutes to complete.

Future edits to the headless trajectory pipeline must preserve ATIF v1.4 compliance. Until now the contract lived only in the changeset and the public README — easy to drift on in a hurry. This commit plants the contract directly on the load-bearing code and tests so a future maintainer cannot relax it by accident. Contract points now documented on-source: - trajectory-collector.ts module header: enumerates the six invariants (schema_version literal, step_id sequence, steps[*].source set, extra.* persistence rule, final_metrics null-when-absent, SDK-only metrics) and bounds the scope (persisted trajectory only — not the JSONL stream). - TrajectoryJson / AtifStep interface JSDocs: explain the spec-version bump discipline and that step-level fields require an ATIF v1.4 definition (or belong under extra). - runner.ts collectTrajectoryEvent: clarifies that the default-case drop is INTENTIONAL (non-ATIF types stay stream-only) and that new cases must ship with a matching extra.* path in finalize(). - runner.ts runHeadless: inline comment pinpoints where the two output surfaces (JSONL stdout vs ATIF trajectory.json) diverge, so the 'headless has two outputs' realisation does not need to be reconstructed every time. - types.ts TrajectoryEvent union: rewrites the one-line 'complete union' docstring into a contract that forces every new event type to pick 'extra.* persistence' or 'drop' at design time, and forbids promotion to a top-level ATIF field or steps[*].source value. - packages/headless/AGENTS.md: new 'ATIF v1.4 COMPLIANCE (persisted trajectory.json)' section listing the same invariants as the authoritative engineering contract. - atif-events.test.ts module header: declares the suite as the executable compliance contract with a pointer to the Python validator that must be updated in lock-step. No runtime behaviour changes. Verified: typecheck, ultracite check, 575 tests, build all pass.

minpeter · 2026-04-21T09:28:56Z

@cubic review

cubic-dev-ai · 2026-04-21T09:29:05Z

@cubic review

@minpeter I have started the AI code review. It will take a few minutes to complete.

Second pass through the changeset against the full commit list surfaced three small holes and one missing package header: - 'plugsuits' (packages/cea) was missing from the changeset header. The PR actually touches its user-facing CLI help text ('--atif' now mentions ATIF-v1.4) and the benchmark validator pipeline, so cea needs a patch bump in lock-step with the other packages. - Added a dedicated CEA bullet describing the --atif wording change and the Python validator pipeline improvements (Harbor validator auto-call + stricter shape check + bool rejection). - Credited the Cubic-reported bool-in-number bug in the 'PR review feedback' section. The fix was already described under the ATIF v1.4 validator improvements bullet, but the review-attribution list had dropped the sixth item. - Noted that the fifth-commit fix quietly restored a dropped LoopHooks.onToolCall declaration. Consumers that rely on this hook had been running against a destructure with no matching type; this is now type-safe again and the changeset mentions it so anyone diffing public types sees why that line moved. No code changes and no behaviour changes — the changeset content now matches the actual commit range on this branch.

cubic-dev-ai

1 issue found across 16 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/harness/src/types.ts">

<violation number="1" location="packages/harness/src/types.ts:173">
P2: Expose the current stream part in `onFirstStreamPart`, or remove the advice about filtering by part type. The callback only receives `LoopContinueContext`, so consumers cannot distinguish framing chunks from visible output as the docs promise.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

The docstring for LoopHooks.onFirstStreamPart told consumers to filter on part type (to distinguish framing chunks like 'start' and 'text-start' from visible output), but the callback only received LoopContinueContext — no part to filter on. The contract was broken at the type level: the advice was physically impossible to follow. This commit fixes the hook before adoption. Since onFirstStreamPart was added earlier in the same PR and has zero consumers anywhere in the monorepo (verified by exhaustive grep across packages/harness, packages/tui, packages/headless, packages/cea, packages/minimal-agent, packages/tgbot, scripts/, and the runtime/session adapter layer), the signature change is type-only and requires no migration. Changes: - types.ts: onFirstStreamPart signature is now (part: TextStreamPart<ToolSet>, context: LoopContinueContext) => void | Promise<void> Docstring rewritten to describe what the hook actually does: fires once per iteration on the very first part regardless of visibility; the consumer decides what to do based on part.type but the hook does not re-fire on a later visible part. - types.ts: imports TextStreamPart from 'ai' (alongside the existing ToolSet / ToolCallPart imports). - index.ts: re-exports TextStreamPart from '@ai-sdk-tool/harness' so consumers don't have to import it from 'ai' separately. This matches how the TUI package already uses the exact type (TextStreamPart<ToolSet>) for its isVisibleStreamPart filter. - loop.ts: generalises invokeObserverHook to accept variadic args (<Args extends readonly unknown[]>), so both (context)-only hooks (onStreamStart) and (part, context) hooks (onFirstStreamPart) share the same error-isolation helper. Call site at the for-await entry now passes the current 'part' alongside the context. - loop.test.ts: four new regression tests covering: * first-part single-fire semantics (tool-call in iteration 0) * per-iteration firing (two iterations, fires twice with correct iteration numbers) * empty-stream skip (no parts → hook never fires) * observer-error isolation (thrown error is logged and swallowed, stream iteration continues normally) Oracle consulted before changing the public signature: recommended Option A ('pre-adoption correction') over Option B (relax docstring) given the zero-consumer inventory. The justification is that the hook is one commit old and has not been released, so fixing it now is materially different from breaking an established API in a patch release. Verified: typecheck (6/6), ultracite check (277 files clean), full test suite (1235 tests, harness +4 new), build (6/6).

minpeter · 2026-04-21T09:45:36Z

Addressed Cubic P2: `onFirstStreamPart` signature corrected (commit `91008d0`)

Thanks @cubic-dev-ai — the docstring on LoopHooks.onFirstStreamPart told consumers to filter on part type to distinguish framing chunks from visible output, but the callback never received the part. The contract was broken at the type level.

Decision

Consulted Oracle on whether to (A) fix the signature or (B) relax the docstring. Oracle recommended Option A: pre-adoption correction based on:

The hook was introduced earlier in this same PR — one commit old, not yet released.
Exhaustive consumer inventory found zero callers across packages/harness, packages/tui, packages/headless, packages/cea, packages/minimal-agent, packages/tgbot, scripts/, and all runtime/session adapter layers. The hook type-only existed in its declaration and one internal call site.
Option B would leave the hook's primary purpose (clearing a prompt-processing indicator when real output starts) measurably weaker.

Fix

New signature:

onFirstStreamPart?: (
  part: TextStreamPart<ToolSet>,
  context: LoopContinueContext
) => void | Promise<void>;

Uses the exact type that flows through stream.fullStream — TextStreamPart<ToolSet> from the Vercel AI SDK. The TUI package already uses the same type for isVisibleStreamPart.
TextStreamPart is now re-exported from @ai-sdk-tool/harness root so consumers don't need to pull it from ai directly.
invokeObserverHook generalised to <Args extends readonly unknown[]> so both (context) and (part, context) hooks share the same error-isolation helper.
Docstring rewritten: clarifies that the hook fires on the very first part regardless of visibility; filtering inside the callback lets consumers decide what to do, it does NOT cause the hook to re-fire on a later visible part. This was an explicit Oracle recommendation to avoid a subtle semantic trap.

Tests added (4 new)

In loop.test.ts:

invokes onFirstStreamPart exactly once with the first emitted part
invokes onFirstStreamPart on each iteration when the stream has content
skips onFirstStreamPart when the stream yields no parts
isolates onFirstStreamPart observer errors from the stream flow

Verification

pnpm run typecheck — 6/6 packages (confirms TextStreamPart<ToolSet> propagates correctly)
pnpm run check (ultracite) — 277 files clean
pnpm run test — 1235 tests pass (harness +4 new for a total of 641)
pnpm run build — 6/6 packages

cubic-dev-ai · 2026-04-21T09:46:02Z

Addressed Cubic P2: onFirstStreamPart signature corrected (commit 91008d0)

Thanks @cubic-dev-ai — the docstring on LoopHooks.onFirstStreamPart told consumers to filter on part type to distinguish framing chunks from visible output, but the callback never received the part. The contract was broken at the type level.

Decision

...

@minpeter The update in 91008d0 effectively resolves the P2 issue by aligning the onFirstStreamPart signature with its documented purpose. Fixing this now as a "pre-adoption correction" was the right call—providing the part directly is essential for consumers to distinguish between framing events (like start) and the actual content that should clear a "Processing..." indicator.

Key improvements in this fix:

Type Safety: Using TextStreamPart<ToolSet> and re-exporting it from @ai-sdk-tool/harness makes the hook easy to use without requiring consumers to manage extra dependencies on ai.
Variadic Helper: Generalizing invokeObserverHook in loop.ts to support variadic arguments is a clean way to maintain error isolation across different hook signatures (onStreamStart vs onFirstStreamPart).
Semantic Clarity: The updated docstring correctly warns about the "first part" behavior, ensuring developers understand it won't re-fire on the first visible part if a framing part arrived first.
Test Coverage: The new cases in loop.test.ts (especially for empty streams and per-iteration firing) provide solid verification for the loop logic.

This change completes the "prompt processing" visibility feature by making the underlying hooks actually usable for their intended purpose. For more on Cubic's review standards, you can refer to the Cubic documentation.

minpeter · 2026-04-21T09:50:44Z

@coderabbitai review

minpeter · 2026-04-21T09:50:45Z

@codex review

minpeter · 2026-04-21T09:50:47Z

@cubic-dev-ai review

coderabbitai · 2026-04-21T09:50:50Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cubic-dev-ai · 2026-04-21T09:51:17Z

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91008d0191

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T09:55:09Z

+    showLoader("Processing...");
+
    const preparedTurn = await prepareMessages(phase);


Move processing spinner init inside cleanup scope

showLoader("Processing...") is invoked before prepareMessages, but clearStatus() only runs in the later try/finally. If prepareMessages throws (for example from onBeforeTurn, usage measurement, or compaction checks), execution exits before the finally, leaving the spinner stuck on screen even though an error is reported in processInput. This creates a persistent false "in progress" UI state after turn-prep failures.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

packages/headless/src/__tests__/atif-events.test.ts (1)
586-687: v1.4 신규 compliance 테스트 적절.

total_cost_usd 합산 테스트에서 toBeCloseTo(0.2, 10)로 부동소수점 오차 가드한 점 양호.

total_cost_usd null 케이스로 "비용 미보고 ≠ 0" 의미론을 명시화.

logprobs/prompt_token_ids/completion_token_ids 보존은 toAtifStep의 { ...rest, metrics } 스프레드 동작(trajectory-collector.ts:187-198)을 정확히 검증합니다.

향후 metrics에 새 옵셔널 필드가 추가되면 ATIF_STEP_ALLOWED_FIELDS 화이트리스트 테스트는 metrics 내부 키까지는 검사하지 않으므로, v1.4 확장 필드의 whitelist 레벨 검증이 필요하면 별도 Set을 추가해 볼 여지가 있습니다(선택적).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/headless/src/__tests__/atif-events.test.ts` around lines 586 - 687,
The tests for TrajectoryCollector (finalize()) correctly validate total_cost_usd
aggregation and null behavior, and preservation of optional ATIF-v1.4 fields; no
code change required—leave TrajectoryCollector.finalize, toAtifStep, and the
ATIF_STEP_ALLOWED_FIELDS logic in trajectory-collector.ts as-is, but if you
later need stricter whitelist checks for nested metrics keys add a separate Set
for allowed metric keys and update toAtifStep to validate metrics against that
Set.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/cea/benchmark/test_trajectory.py`:
- Around line 13-18: The helper _is_real_number currently only excludes bools so
float('nan') and float('inf') pass; update _is_real_number to also reject
non-finite floats by using math.isfinite (i.e., return isinstance(value, (int,
float)) and not isinstance(value, bool) and (math.isfinite(value) if
isinstance(value, float) else True)) and add the required import for math; leave
_is_real_int unchanged.

In `@packages/headless/AGENTS.md`:
- Around line 16-23: Update the documentation in AGENTS.md to correct the
lifecycle-annotation persistence behavior: change the paragraph so it states
that only `turn-start` and `error` are JSONL-only and dropped from the persisted
trajectory, while `approval`, `compaction`, and `interrupt` annotations are
preserved by the runner and recorded under `extra.approval_events`,
`extra.compaction_events`, and `extra.interrupt_events` (as implemented by
`collectTrajectoryEvent`); remove the blanket claim that those three annotation
types are dropped and add brief note pointing readers to
`collectTrajectoryEvent` for exact field names.

In `@packages/headless/src/trajectory-collector.ts`:
- Around line 75-79: The TrajectoryJson.extra type is too permissive because the
Record<string, unknown> lets arbitrary lifecycle buckets bypass the ATIF
persistence contract; remove the open record and restrict extra to only the
known persisted buckets (approval_events?: ApprovalEvent[]; compaction_events?:
CompactionEvent[]; interrupt_events?: InterruptEvent[]), then update/verify
TrajectoryCollector.finalize() and the TrajectoryCollector implementation to
only write those known buckets and require adding a new collector method
whenever a new lifecycle bucket is introduced.
- Around line 16-20: The doc comment is inconsistent about which lifecycle
events are persisted; update the wording in trajectory-collector.ts to clearly
state that approval/compaction/interrupt lifecycle annotations are persisted
under extra.approval_events, extra.compaction_events and extra.interrupt_events
on the trajectory, while events like turn-start and error are not persisted as
step sources and live only in the JSONL stream; also add the guideline note that
adding a new persisted lifecycle bucket requires implementing a corresponding
collector method.

In `@packages/headless/src/types.ts`:
- Around line 4-13: The comment says the docblock incorrectly states that
lifecycle annotations (`approval`, `compaction`, `interrupt`, `turn-start`) are
JSONL-only; in reality `approval`, `compaction`, and `interrupt` are persisted
to ATIF via TrajectoryCollector as extra.approval_events,
extra.compaction_events, and extra.interrupt_events. Update the comment in
packages/headless/src/types.ts to restrict "JSONL-only" to only transient
signals (e.g., `turn-start`) and explicitly note that `approval`, `compaction`,
and `interrupt` are forwarded by TrajectoryCollector into ATIF `extra.*` fields
while `turn-start` remains transient.

In `@packages/tui/src/agent-tui.ts`:
- Around line 1041-1064: Capture the current history revision at the start of
measureUsageIfAvailable (e.g., const startRev =
config.messageHistory.getRevision()) and after awaiting
config.measureUsage(messages) verify the revision is unchanged before calling
config.messageHistory.updateActualUsage; if startRev !==
config.messageHistory.getRevision() treat the probe as stale and return false.
Do this in the measureUsageIfAvailable path that updates usage (the block using
usageProbeGeneration and calling config.messageHistory.updateActualUsage) and
apply the same revision-capture-and-check to the other similar spot (the second
occurrence around where updateActualUsage is called) so background probes cannot
overwrite newer history.

---

Nitpick comments:
In `@packages/headless/src/__tests__/atif-events.test.ts`:
- Around line 586-687: The tests for TrajectoryCollector (finalize()) correctly
validate total_cost_usd aggregation and null behavior, and preservation of
optional ATIF-v1.4 fields; no code change required—leave
TrajectoryCollector.finalize, toAtifStep, and the ATIF_STEP_ALLOWED_FIELDS logic
in trajectory-collector.ts as-is, but if you later need stricter whitelist
checks for nested metrics keys add a separate Set for allowed metric keys and
update toAtifStep to validate metrics against that Set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 983e34d5-0965-43c7-81a5-3cfa57925765

📥 Commits

Reviewing files that changed from the base of the PR and between 88b7197 and 91008d0.

📒 Files selected for processing (18)

.changeset/prompt-processing-indicators.md
packages/cea/benchmark/AGENTS.md
packages/cea/benchmark/scorer.py
packages/cea/benchmark/test_trajectory.py
packages/cea/src/entrypoints/main.ts
packages/harness/src/index.ts
packages/harness/src/loop.test.ts
packages/harness/src/loop.ts
packages/harness/src/types.ts
packages/headless/AGENTS.md
packages/headless/README.md
packages/headless/src/__tests__/atif-events.test.ts
packages/headless/src/runner.test.ts
packages/headless/src/runner.ts
packages/headless/src/trajectory-collector.ts
packages/headless/src/types.ts
packages/tui/src/agent-tui.ts
packages/tui/src/stream-handlers.ts

coderabbitai · 2026-04-21T09:58:07Z

+  let usageProbeGeneration = 0;
+
  const measureUsageIfAvailable = async (
    messages: ModelMessage[]
  ): Promise<boolean> => {
    if (!config.measureUsage) {
      return false;
    }

+    usageProbeGeneration += 1;
+    const thisGeneration = usageProbeGeneration;
+
    const measured = normalizeUsageMeasurement(
      await config.measureUsage(messages)
    );
    if (!measured) {
      return false;
    }

+    if (thisGeneration !== usageProbeGeneration) {
+      return false;
+    }
+
    config.messageHistory.updateActualUsage({


⚠️ Potential issue | 🟠 Major

startup probe 결과를 history revision으로도 무효화해 주세요.

usageProbeGeneration은 “더 새 probe가 시작된 경우”만 막습니다. 사용자가 입력해 messageHistory가 바뀐 직후, 다음 measureUsageIfAvailable(messagesForLLM)가 시작되기 전에 background measureUsageIfAvailable([])가 resolve되면 빈 메시지 기준 usage가 현재 history에 적용될 수 있습니다. getRevision()을 캡처해 probe 시작/완료 사이 history 변경도 stale 처리해 주세요.

🛡️ 제안 수정

usageProbeGeneration += 1; const thisGeneration = usageProbeGeneration; + const revisionAtStart = config.messageHistory.getRevision?.(); const measured = normalizeUsageMeasurement( await config.measureUsage(messages) ); @@ - if (thisGeneration !== usageProbeGeneration) { + if ( + thisGeneration !== usageProbeGeneration || + (revisionAtStart !== undefined && + config.messageHistory.getRevision?.() !== revisionAtStart) + ) { return false; }

Also applies to: 1081-1083

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/tui/src/agent-tui.ts` around lines 1041 - 1064, Capture the current history revision at the start of measureUsageIfAvailable (e.g., const startRev = config.messageHistory.getRevision()) and after awaiting config.measureUsage(messages) verify the revision is unchanged before calling config.messageHistory.updateActualUsage; if startRev !== config.messageHistory.getRevision() treat the probe as stale and return false. Do this in the measureUsageIfAvailable path that updates usage (the block using usageProbeGeneration and calling config.messageHistory.updateActualUsage) and apply the same revision-capture-and-check to the other similar spot (the second occurrence around where updateActualUsage is called) so background probes cannot overwrite newer history.

cubic-dev-ai

3 issues found across 18 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/tui/src/agent-tui.ts">

<violation number="1" location="packages/tui/src/agent-tui.ts:1537">
P2: `Processing...` loader is shown before the guarded `try/finally`, so setup errors can leave a stuck spinner.</violation>
</file>

<file name="packages/headless/README.md">

<violation number="1" location="packages/headless/README.md:172">
P3: Clarify that only transient annotations like `turn-start` are dropped; approval/compaction/interrupt are still persisted under `extra.*` in `trajectory.json`.</violation>
</file>

<file name="packages/headless/AGENTS.md">

<violation number="1" location="packages/headless/AGENTS.md:21">
P3: Clarify that only `turn-start` is dropped; `approval`, `compaction`, and `interrupt` are persisted under `trajectory.extra.*`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Cycle-1 triggered a fresh round of AI reviews. Consolidating every actionable finding into a single commit: P1 (Oracle): runHeadless could persist an invalid zero-step ATIF trajectory when the stream failed before any step was emitted. Harbor's own validator rejects steps: []. TrajectoryCollector.writeTo now returns boolean and skips disk writes for zero-step runs instead of producing an invalid file. P2 (Codex + Cubic): showLoader('Processing...') lived outside the runSingleStreamTurn try/finally, so a thrown prepareMessages / onBeforeTurn / usage probe / compaction check would leave the spinner stuck on screen. Moved the call into the try block so clearStatus() in finally cleans it up on every exit path. P2 (CodeRabbit): the startup usage-probe guard only rejected results from stale generations, not results computed against a different history. Added a messageHistory.getRevision() capture-and-compare so a background probe resolving after user input no longer overwrites the new turn's real usage with empty-message baseline. P2 (CodeRabbit): TrajectoryJson.extra was typed as '{approval_events?, compaction_events?, interrupt_events?} & Record<string, unknown>', letting new lifecycle buckets type-check without a matching TrajectoryCollector method. Dropped the open-record intersection so the ATIF persistence contract is enforced structurally. P2 (CodeRabbit): test_trajectory.py's _is_real_number accepted NaN, Infinity, and -Infinity (all producible via json.loads on non-strict input). Added an explicit math.isfinite() check. P3 (CodeRabbit + Cubic): documentation drift — several docstrings and markdown files claimed 'approval/compaction/interrupt are JSONL-only and dropped from trajectory.json'. That is wrong: they are persisted under trajectory.extra.* by the collector. Only turn-start and error are truly transient. Corrected in packages/headless/AGENTS.md, packages/headless/README.md, packages/headless/src/types.ts, packages/headless/src/trajectory-collector.ts module JSDoc, and the root AGENTS.md. Regression test added: runner.test.ts now asserts that zero-step runs with atifOutputPath set do NOT produce a file (does not write an invalid zero-step trajectory when the stream fails before any step). Verified: typecheck (6/6), ultracite check (277 files clean), tests (1236 pass; headless +1 new = 65), build (6/6).

minpeter · 2026-04-21T10:19:53Z

@coderabbitai review

minpeter · 2026-04-21T10:19:55Z

@codex review

minpeter · 2026-04-21T10:19:56Z

@cubic-dev-ai review

coderabbitai · 2026-04-21T10:19:59Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cubic-dev-ai · 2026-04-21T10:20:16Z

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

No issues found across 19 files

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a810f4dd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T10:26:30Z

+    if (this.steps.length === 0) {
+      return false;


Clear stale trajectory output on zero-step runs

When writeTo returns early on this.steps.length === 0, it leaves any existing outputPath file untouched; with a reused path (e.g., CEA defaults to trajectory.json), a zero-step run (such as an early abort/timeout before any step is emitted) can silently preserve and expose the previous run's trajectory as if it were current output. This is a data-integrity regression introduced by the new skip-write behavior; the zero-step path should delete/truncate the target file or force callers to handle cleanup when false is returned.

Useful? React with 👍 / 👎.

Cycle 2: Cubic reported 'No issues found across 19 files'. Oracle's second audit surfaced three genuine items the bots missed: 1. Headless probe race (Oracle P2): measureUsageIfAvailable in the headless runner lacked the generation + revision guard that the TUI already had. A slow background probe could overwrite fresher post-compaction usage. Mirrored the TUI pattern using messageHistory.getRevision?.() with optional chaining (the headless message history interface declares getRevision as optional). 2. ATIF step source contract drift (Oracle P2): trajectory-collector.ts permits steps[*].source = 'user' | 'agent' | 'system' (Harbor ATIF v1.4 allows all three; system steps support observations since v1.2), but the bundled Python validator rejected 'system' and the benchmark docs documented only 'user | agent'. Aligned all three surfaces. 3. Root README.md drift (Oracle P3): headless event list omitted turn-start and did not point at Harbor's ATIF-v1.4 schema. Updated to match the current package docs. Verified: typecheck (6/6), ultracite check (277 files clean), 1236 tests pass, build (6/6). No new tests required — the headless probe guard is covered by the same invariants the TUI guard already tests.

minpeter · 2026-04-21T10:30:02Z

@coderabbitai review

minpeter · 2026-04-21T10:30:03Z

@codex review

minpeter · 2026-04-21T10:30:05Z

@cubic-dev-ai review

coderabbitai · 2026-04-21T10:30:27Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cubic-dev-ai · 2026-04-21T10:30:32Z

@cubic-dev-ai review

@minpeter I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

No issues found across 20 files

…ype (cycle 3) Cycle 3 Oracle audit flagged exactly one remaining issue: the changeset for PR #117 said in its 'features' bullet (line 14) that TrajectoryJson.extra was typed as an open record, then said in its 'cycle 1 follow-ups' bullet (line 30) that it was later narrowed back to the three canonical lifecycle buckets. The shipped code matches the NARROWED version, so the earlier bullet was stale. Fixed by rewriting line 14 to reflect the final, shipped state: extra is a closed record of exactly approval_events / compaction_events / interrupt_events, and new lifecycle types must extend the interface explicitly. No runtime changes. Verified: ultracite check clean, 1236 tests pass, build (6/6 full turbo).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8160c46acb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T10:36:58Z

+    if (this.steps.length === 0) {
+      return false;


Clear stale ATIF output when skipping zero-step writes

Returning early when this.steps.length === 0 leaves any pre-existing file at outputPath untouched. If a run fails before emitting a step (for example, agent.stream() rejects before any user/agent step is recorded) and reuses the same atifOutputPath, downstream tooling will read stale trajectory data from an older run and silently report incorrect metrics. Ensure the path is removed or overwritten when zero-step persistence is skipped.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T10:36:58Z

+  const runBackgroundStartupProbe = (): void => {
+    measureUsageIfAvailable([]).then(ignore, ignore);


Cancel background startup probe on TUI teardown

The startup usage probe is launched fire-and-forget and never tracked or canceled. In sessions where measureUsage performs network I/O, exiting the TUI before that probe resolves can keep the Node process alive until the probe completes, and the late result can still mutate history/header state after shutdown. Tie this probe to teardown (or an abort signal) so early exits are deterministic.

Useful? React with 👍 / 👎.

cubic-dev-ai · 2026-04-21T10:37:06Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/headless/src/runner.ts Outdated

Comment thread packages/headless/README.md

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/tui/src/agent-tui.ts

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/cea/benchmark/test_trajectory.py Outdated

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/tui/src/agent-tui.ts Outdated

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/harness/src/types.ts Outdated

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/tui/src/agent-tui.ts Outdated

Comment thread packages/headless/README.md Outdated

Comment thread packages/headless/AGENTS.md Outdated

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

minpeter merged commit 496ffdb into main Apr 21, 2026
6 checks passed

minpeter deleted the feat/prompt-processing-indicators branch April 22, 2026 09:59

		showLoader("Working...");
		await config.onStreamStart?.(phase);

		await config.onStreamStart?.(phase);

		const stream = await config.agent.stream(

		showLoader("Processing...");

		const preparedTurn = await prepareMessages(phase);

		const runBackgroundStartupProbe = (): void => {
		measureUsageIfAvailable([]).then(ignore, ignore);

Conversation

minpeter commented Apr 21, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by package

`@ai-sdk-tool/harness` (patch)

`@ai-sdk-tool/tui` (patch)

`@ai-sdk-tool/headless` (patch)

UX before / after

Compatibility

Verification

Changeset

Summary by cubic

Summary by CodeRabbit

릴리스 노트

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

워크스루

변경사항

시퀀스 다이어그램

예상 코드 리뷰 노력

시

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

minpeter commented Apr 21, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Addressed all automated reviewer feedback (commit `245214f`)

Addressed Cubic P1: observer-hook error isolation (commit `d855e36`)

Addressed Cubic P1: observer-hook error isolation (commit `d855e36`)

Addressed Cubic P2: `onFirstStreamPart` signature corrected (commit `91008d0`)

Addressed Cubic P2: `onFirstStreamPart` signature corrected (commit `91008d0`)