Skip to content

ClawSweeper re-review fails with Codex transient_transport on moderate fork PRs (3/3 retries, then falls back to off-meta tidepool) #282

@wangwllu

Description

@wangwllu

Symptom

@clawsweeper re-review on a fork PR consistently fails with reason=transient_transport from Codex retries, then falls back to a placeholder review:

  • decision=keep_open confidence=low action=kept_open
  • review_summary: "Review failed before ClawSweeper could summarize the requested change."
  • routed verdict label: 🌊 off-meta tidepool / [P1] Review did not complete (retryable codex transport failure)

The run itself reports conclusion: success (all GitHub Actions steps succeed), so the failure is only visible by reading step 13 "Review exact event item" logs.

Repro

PR: openclaw/openclaw#92181 (fork PR, +135 src / +103 tests, 4 files, body 4243 B, diff 21460 B — moderate size).

Three independent @clawsweeper re-review attempts:

Time (UTC) Run Outcome
2026-06-11 14:45Z https://github.com/openclaw/clawsweeper/actions/runs/27355240940 Failed (transient_transport)
2026-06-11 16:16Z (status comment shows same pattern) Failed (transient_transport)
2026-06-12 02:10Z https://github.com/openclaw/clawsweeper/actions/runs/27389971002 Failed (transient_transport)

Run 27389971002 step 13 log excerpt

[review] 2026-06-12T02:11:27Z shard=0/1 selected=1 scanned_pages=0
[review] 2026-06-12T02:11:27Z shard=0/1 start #92181 (1/1)
[review] 2026-06-12T02:11:35Z shard=0/1 start-comment=existing #92181
[review] 2026-06-12T02:12:28Z codex-retry #92181 attempt=2/3 delay_ms=15000 reason=transient_transport
[review] 2026-06-12T02:13:30Z codex-retry #92181 attempt=3/3 delay_ms=30000 reason=transient_transport
[review] 2026-06-12T02:14:10Z shard=0/1 done #92181 (1/1) decision=keep_open confidence=low action=kept_open
[review] 2026-06-12T02:14:10Z shard=0/1 complete reviewed=1
Error: Codex failed for 1 item; review artifacts were written and the workflow recovery lane can requeue the planned set.
ELIFECYCLE  Command failed with exit code 1.

Each attempt completed in well under the configured --codex-timeout-ms 600000 and timeout 12m outer limits — Codex returned the transport error early rather than hitting the time budget.

Codex CLI invocation

From the same job log:

timeout --kill-after=30s 12m pnpm run review -- \
  --target-repo openclaw/openclaw \
  --target-dir openclaw \
  --artifact-dir artifacts/event \
  --batch-size 1 \
  --max-pages 1 \
  --codex-model internal \
  --codex-reasoning-effort high \
  --codex-sandbox danger-full-access \
  --codex-timeout-ms 600000 \
  --item-numbers 92181 \
  --readonly-openclaw \
  --shard-index 0 \
  --shard-count 1

codex-cli 0.139.0 is what was installed in the cache hit step.

Why this looks like a backend issue, not a content issue

  • Three independent attempts with identical retry pattern across two days → not flake
  • All three exhausted the 3-retry budget without ever returning a non-transport error → suggests the failure is at connection / streaming layer rather than model output
  • PR diff and body are moderate; this isn't a giant patch
  • Other PRs reviewed by the same workflow during the same window completed normally (e.g. recent runs against unrelated openclaw/openclaw PRs in the same dispatch event window succeeded), so the runner / setup / token plumbing is fine

Possible directions

The failing attempts all use --codex-reasoning-effort high, which produces long thinking streams. If the internal Codex backend or any intermediate proxy has a stream / idle timeout shorter than the long-tail thinking time on certain prompts, the connection would close and surface as transient_transport regardless of the explicit per-call timeout. That would also explain why specifically this PR keeps reproducing while other PRs in the same workflow complete.

Possible mitigations on the ClawSweeper side, in case the upstream Codex fix is not quick:

  • Per-PR fallback path that retries a failed review at lower reasoning effort instead of marking 🌊 off-meta tidepool / Review did not complete
  • Surface transient_transport in the verdict comment (currently it just says did not finish cleanly, which reads like a queue / dispatcher problem)
  • Optional additional-prompt knob to let an author request a smaller-context retry without changing the underlying PR

Affected user-visible state

  • PR carries 🌊 off-meta tidepool rating with [P1] Review did not complete (retryable codex transport failure)
  • Per-PR clawsweeper-command-progress:end block shows State: Failed / Detail: The targeted re-review did not finish cleanly. Check the workflow run for details.
  • No path forward for the contributor short of pinging a maintainer, since further @clawsweeper re-review calls reproduce the same outcome

Happy to provide more logs, retry traces, or attempt a smaller-diff variant of the same PR if it helps narrow down whether prompt size, reasoning effort, or backend capacity is the dominant factor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Urgent regression or broken agent/channel workflow affecting real users now.clawsweeper:current-main-reproClawSweeper found a high-confidence current-main issue reproduction.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.issue-rating: 🦀 challenger crabExceptional issue quality: high-confidence current-main reproduction and actionable evidence.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions