Add a length/token-limit failure_mode (finish_reason == "length") to separate overthink-truncation from verifier_fail

## Problem

A model that overthinks or gets into a repetition loop until it hits `max_tokens` currently scores as a plain `verifier_fail` — **indistinguishable from a genuinely wrong answer.** There is no `finish_reason == "length"` detection in the runner today, so "the model ran out of budget mid-thought" and "the model produced a confidently wrong answer" land in the same bucket.

This surfaced in [club-3090 discussions/241](https://github.com/noonghunna/club-3090/discussions/241): a known failure mode of Qwen3.6-35B-A3B is that it "tends to get into a loop or overthink," and one quant's run got stuck + took ~2× the runtime of its siblings — but the breakdown couldn't tell us whether that was the model looping (truncated at the token cap) or the harness hanging.

## Proposal

- Detect `finish_reason == "length"` (and optionally a cheap repetition heuristic) and classify it as a distinct `failure_mode` — e.g. `token_limit` / `output_truncated` — instead of folding it into `verifier_fail`.
- Add it to the canonical failure-mode list (`benchlocal_cli/types.py`) and the end-of-run `Failure breakdown:` + saved JSON, so it shows up in `inspect --mode token_limit`.

## Why it matters

Per-scenario `latency_seconds` already flags the slow outliers, but a first-class class makes **looped vs. hung vs. genuinely wrong** legible at a glance:
- `token_limit` → model truncated (raise the budget, or it's looping)
- `agent_runner_timeout` / `agent_runner_crashed` → sandboxed-agent path (already exists)
- `verifier_fail` → ran to completion, answer wrong

## Notes

- The agentic packs already separate `agent_runner_timeout`; this is the gap on the **non-agentic completion path**.
- Pairs naturally with the per-scenario tokens already captured (`tokens_completion`) — a near-`max_tokens` completion + `finish_reason == "length"` is the strong signal.

_Reported by @laurimyllari in club-3090 discussions/241._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a length/token-limit failure_mode (finish_reason == "length") to separate overthink-truncation from verifier_fail #61

Problem

Proposal

Why it matters

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add a length/token-limit failure_mode (finish_reason == "length") to separate overthink-truncation from verifier_fail #61

Description

Problem

Proposal

Why it matters

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions