Problem
A model that overthinks or gets into a repetition loop until it hits max_tokens currently scores as a plain verifier_fail — indistinguishable from a genuinely wrong answer. There is no finish_reason == "length" detection in the runner today, so "the model ran out of budget mid-thought" and "the model produced a confidently wrong answer" land in the same bucket.
This surfaced in club-3090 discussions/241: a known failure mode of Qwen3.6-35B-A3B is that it "tends to get into a loop or overthink," and one quant's run got stuck + took ~2× the runtime of its siblings — but the breakdown couldn't tell us whether that was the model looping (truncated at the token cap) or the harness hanging.
Proposal
- Detect
finish_reason == "length" (and optionally a cheap repetition heuristic) and classify it as a distinct failure_mode — e.g. token_limit / output_truncated — instead of folding it into verifier_fail.
- Add it to the canonical failure-mode list (
benchlocal_cli/types.py) and the end-of-run Failure breakdown: + saved JSON, so it shows up in inspect --mode token_limit.
Why it matters
Per-scenario latency_seconds already flags the slow outliers, but a first-class class makes looped vs. hung vs. genuinely wrong legible at a glance:
token_limit → model truncated (raise the budget, or it's looping)
agent_runner_timeout / agent_runner_crashed → sandboxed-agent path (already exists)
verifier_fail → ran to completion, answer wrong
Notes
- The agentic packs already separate
agent_runner_timeout; this is the gap on the non-agentic completion path.
- Pairs naturally with the per-scenario tokens already captured (
tokens_completion) — a near-max_tokens completion + finish_reason == "length" is the strong signal.
Reported by @laurimyllari in club-3090 discussions/241.
Problem
A model that overthinks or gets into a repetition loop until it hits
max_tokenscurrently scores as a plainverifier_fail— indistinguishable from a genuinely wrong answer. There is nofinish_reason == "length"detection in the runner today, so "the model ran out of budget mid-thought" and "the model produced a confidently wrong answer" land in the same bucket.This surfaced in club-3090 discussions/241: a known failure mode of Qwen3.6-35B-A3B is that it "tends to get into a loop or overthink," and one quant's run got stuck + took ~2× the runtime of its siblings — but the breakdown couldn't tell us whether that was the model looping (truncated at the token cap) or the harness hanging.
Proposal
finish_reason == "length"(and optionally a cheap repetition heuristic) and classify it as a distinctfailure_mode— e.g.token_limit/output_truncated— instead of folding it intoverifier_fail.benchlocal_cli/types.py) and the end-of-runFailure breakdown:+ saved JSON, so it shows up ininspect --mode token_limit.Why it matters
Per-scenario
latency_secondsalready flags the slow outliers, but a first-class class makes looped vs. hung vs. genuinely wrong legible at a glance:token_limit→ model truncated (raise the budget, or it's looping)agent_runner_timeout/agent_runner_crashed→ sandboxed-agent path (already exists)verifier_fail→ ran to completion, answer wrongNotes
agent_runner_timeout; this is the gap on the non-agentic completion path.tokens_completion) — a near-max_tokenscompletion +finish_reason == "length"is the strong signal.Reported by @laurimyllari in club-3090 discussions/241.