Skip to content

Add cost-protection pause for suspicious long-running sessions #2957

@franksong2702

Description

@franksong2702

Problem

A real WebUI session on the stable local runtime showed how a long-running agent turn can keep consuming model calls and time after it has likely entered a low-value recovery path. The observed turn was manually cancelled after about 6 minutes and had already made repeated model calls, context-compression attempts, and tool retries. The task had extracted the target article body but had not converged to writing the file or returning a final answer.

This is not a user-error problem. Users should not need to diagnose live agent logs to know whether a turn is productively working, retrying, compressing, or stuck in a high-cost recovery path.

Desired behavior

WebUI should not automatically decide that the task failed. Instead, when objective high-risk signals accumulate, WebUI should pause before the next expensive step and ask the user what to do.

Example copy:

This run may be stuck in a high-cost recovery path: 13 model calls, 11 context compressions, 2 tool errors, and no final assistant output yet. Continue?

Actions:

  • Continue
  • Stop
  • Summarize and stop

Candidate risk signals

  • Long active run age with no final assistant output
  • Repeated context compression in one turn
  • Compression timeout or repeated compression failure
  • Repeated API retry or provider connection failures
  • Repeated tool errors for the same target
  • High model-call or tool-call count for one user turn

Scope for first PR

A first slice should be deliberately narrow:

  • Add a backend cost-protection/risk gate at safe run-loop boundaries, before issuing the next model/tool call where feasible.
  • Emit a structured event to WebUI when the gate pauses a run.
  • Render a confirmation card in the existing chat flow with Continue and Stop actions.
  • Keep auto-stop out of scope unless the user has configured a hard budget.
  • Preserve existing Stop/cancel behavior.

Non-goals

  • Do not let WebUI silently judge a task as failed.
  • Do not pause inside an already-blocking model/compression call unless that call already supports cancellation.
  • Do not introduce a new runner process or large runtime-adapter migration in this slice.

Evidence from local incident

Observed on a local 8787 stable runtime session while clipping a WeChat article:

  • Active for about 6 minutes before manual cancellation
  • 13 model API calls
  • 11 context-compression attempts
  • 2 tool errors
  • At least one 120s auxiliary compression timeout
  • Article body had been extracted, but no final saved Markdown was produced

Contract routing

Task type: runtime / streaming / user-facing safety UX
Touched areas: run lifecycle, SSE events, chat UI, cancellation/continue controls
Relevant docs:

  • AGENTS.md
  • CONTRIBUTING.md
  • docs/CONTRACTS.md
  • docs/rfcs/webui-run-state-consistency-contract.md
  • docs/rfcs/hermes-run-adapter-contract.md
  • docs/UIUX-GUIDE.md
  • DESIGN.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions