Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution

### Problem or motivation

agentevals currently focuses on **correctness** evaluation (tool trajectory matching, response quality). But in production, **performance** matters just as much — an agent that gets the right answer but burns 1M tokens and takes 5 minutes is not production-ready.

### Proposed solution

Add three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed.

### 1. Token Efficiency (`token_efficiency`)

Scores how efficiently the agent used tokens relative to a budget.

```yaml
evaluators:
  - name: token_efficiency
    type: builtin
    config:
      max_tokens: 200000        # budget
      weight_input: 0.7         # input tokens weighted more (they're the cost driver)
      weight_output: 0.3
```

**Scoring:**
- Extracts `gen_ai.usage.input_tokens` + `gen_ai.usage.output_tokens` from trace spans
- Score = `1.0 - (actual_tokens / max_tokens)`, clamped to [0, 1]
- Score 1.0 = very efficient, 0.0 = budget exceeded

**Why it matters:** From our [AI Agent Benchmark](https://github.com/henrikrexed/k8s-ai-agent-benchmark), token usage varied 8x across solutions for the same task (185K vs 1.6M tokens). This evaluator catches regressions.

### 2. Tool Efficiency (`tool_efficiency`)

Scores whether the agent used tools effectively — penalizes waste.

```yaml
evaluators:
  - name: tool_efficiency
    type: builtin
    config:
      max_tool_calls: 15        # budget
      penalize_duplicates: true  # repeated identical calls
      penalize_errors: true      # failed tool calls
```

**Scoring:**
- Count total tool call spans from trace
- Identify duplicates (same tool + same args called twice)
- Identify errors (tool spans with error status)
- Score = `(useful_calls / total_calls) * (1.0 - budget_overrun_penalty)`

**What it catches:**
- Agent stuck in a loop calling the same tool repeatedly
- Agent calling tools it doesn't use the results from
- Agent exceeding reasonable tool call limits

### 3. Time Efficiency (`time_efficiency`)

Scores how quickly the agent resolved relative to a budget.

```yaml
evaluators:
  - name: time_efficiency
    type: builtin
    config:
      max_duration_s: 120       # budget in seconds
```

**Scoring:**
- Extract root span duration from trace
- Score = `1.0 - (actual_duration / max_duration)`, clamped to [0, 1]

### Eval Set Integration

These evaluators can be combined with `performance_budget` in eval cases:

```json
{
  "eval_id": "crashloop_diagnosis",
  "conversation": [...],
  "performance_budget": {
    "max_tokens": 200000,
    "max_duration_s": 120,
    "max_tool_calls": 10
  }
}
```

### CI/CD Gating

```bash
agentevals run trace.json \
  --eval-set k8s-sre.json \
  -m tool_trajectory_avg_score \
  -m token_efficiency \
  -m tool_efficiency \
  -m time_efficiency \
  --threshold 0.7
```

### Alternatives considered

_No response_

### Additional context

Three new evaluators following the custom evaluator protocol, but shipped as builtins:
- `src/agentevals/evaluator/token_efficiency.py`
- `src/agentevals/evaluator/tool_efficiency.py`
- `src/agentevals/evaluator/time_efficiency.py`

They use `extract_performance_metrics()` from `trace_metrics.py` which already extracts token counts, latencies, and tool calls from OTel spans.

### Human confirmation

- [x] I am a human (not a bot, agent, or AI) filing this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution #6

Problem or motivation

Proposed solution

1. Token Efficiency (`token_efficiency`)

2. Tool Efficiency (`tool_efficiency`)

3. Time Efficiency (`time_efficiency`)

Eval Set Integration

CI/CD Gating

Alternatives considered

Additional context

Human confirmation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution #6

Description

Problem or motivation

Proposed solution

1. Token Efficiency (token_efficiency)

2. Tool Efficiency (tool_efficiency)

3. Time Efficiency (time_efficiency)

Eval Set Integration

CI/CD Gating

Alternatives considered

Additional context

Human confirmation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Token Efficiency (`token_efficiency`)

2. Tool Efficiency (`tool_efficiency`)

3. Time Efficiency (`time_efficiency`)