Problem
Neither Codex nor OpenCode integrations have an e2e test that drives a real agent CLI against a live model and asserts the UI badge lifecycle (waiting → thinking → tool_use → waiting). Our unit tests cover parsers in isolation; our fs-level tests cover watcher plumbing. Nothing covers "user ran the actual binary and the right thing showed up in Kolu."
This is the same gap claude-code has partially covered via transcript-replay fixtures. It has bitten us twice on codex-provider already (the cumulative-tokens_used bug in 944f19d and the cached-token double-count bug in 431edd3 would both have been caught by an e2e assertion on the badge value).
Proposed shape
All-Nix, no API keys, deterministic:
- Ollama service (via
services.ollama in a Nix test shell) with a small model pinned by hash — e.g. qwen2.5-coder:0.5b or similar. Model must support tool calls for the tool_use branch.
- Scripted agent session: spawn
codex --yolo (or opencode run) in a scratch worktree, point it at the local Ollama endpoint via each tool's standard OpenAI-compatible base-URL env, feed it a canned prompt that exercises: a pure thinking turn, a tool-using turn, and completion.
- Assertions: tail the agent's state files (SQLite / JSONL) the same way Kolu does, and assert the observed
CodexInfo/OpenCodeInfo sequence matches the expected lifecycle. Runs under the same ambient watcher stack the server uses.
Why Ollama
- No API keys in CI
- Deterministic-enough (we assert on lifecycle transitions, not content)
- Reusable across integrations — codex, opencode, and any future OpenAI-compatible agent
- Already in nixpkgs; zero new dependencies
Non-goals
- Asserting model output text
- Measuring latency
- Covering every agent setting — just the badge-relevant lifecycle
Related
Problem
Neither Codex nor OpenCode integrations have an e2e test that drives a real agent CLI against a live model and asserts the UI badge lifecycle (
waiting→thinking→tool_use→waiting). Our unit tests cover parsers in isolation; our fs-level tests cover watcher plumbing. Nothing covers "user ran the actual binary and the right thing showed up in Kolu."This is the same gap claude-code has partially covered via transcript-replay fixtures. It has bitten us twice on codex-provider already (the cumulative-
tokens_usedbug in944f19dand the cached-token double-count bug in431edd3would both have been caught by an e2e assertion on the badge value).Proposed shape
All-Nix, no API keys, deterministic:
services.ollamain a Nix test shell) with a small model pinned by hash — e.g.qwen2.5-coder:0.5bor similar. Model must support tool calls for thetool_usebranch.codex --yolo(oropencode run) in a scratch worktree, point it at the local Ollama endpoint via each tool's standard OpenAI-compatible base-URL env, feed it a canned prompt that exercises: a pure thinking turn, a tool-using turn, and completion.CodexInfo/OpenCodeInfosequence matches the expected lifecycle. Runs under the same ambient watcher stack the server uses.Why Ollama
Non-goals
Related