Problem or motivation
agentevals currently focuses on correctness evaluation (tool trajectory matching, response quality). But in production, performance matters just as much — an agent that gets the right answer but burns 1M tokens and takes 5 minutes is not production-ready.
Proposed solution
Add three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed.
1. Token Efficiency (token_efficiency)
Scores how efficiently the agent used tokens relative to a budget.
evaluators:
- name: token_efficiency
type: builtin
config:
max_tokens: 200000 # budget
weight_input: 0.7 # input tokens weighted more (they're the cost driver)
weight_output: 0.3
Scoring:
- Extracts
gen_ai.usage.input_tokens + gen_ai.usage.output_tokens from trace spans
- Score =
1.0 - (actual_tokens / max_tokens), clamped to [0, 1]
- Score 1.0 = very efficient, 0.0 = budget exceeded
Why it matters: From our AI Agent Benchmark, token usage varied 8x across solutions for the same task (185K vs 1.6M tokens). This evaluator catches regressions.
2. Tool Efficiency (tool_efficiency)
Scores whether the agent used tools effectively — penalizes waste.
evaluators:
- name: tool_efficiency
type: builtin
config:
max_tool_calls: 15 # budget
penalize_duplicates: true # repeated identical calls
penalize_errors: true # failed tool calls
Scoring:
- Count total tool call spans from trace
- Identify duplicates (same tool + same args called twice)
- Identify errors (tool spans with error status)
- Score =
(useful_calls / total_calls) * (1.0 - budget_overrun_penalty)
What it catches:
- Agent stuck in a loop calling the same tool repeatedly
- Agent calling tools it doesn't use the results from
- Agent exceeding reasonable tool call limits
3. Time Efficiency (time_efficiency)
Scores how quickly the agent resolved relative to a budget.
evaluators:
- name: time_efficiency
type: builtin
config:
max_duration_s: 120 # budget in seconds
Scoring:
- Extract root span duration from trace
- Score =
1.0 - (actual_duration / max_duration), clamped to [0, 1]
Eval Set Integration
These evaluators can be combined with performance_budget in eval cases:
{
"eval_id": "crashloop_diagnosis",
"conversation": [...],
"performance_budget": {
"max_tokens": 200000,
"max_duration_s": 120,
"max_tool_calls": 10
}
}
CI/CD Gating
agentevals run trace.json \
--eval-set k8s-sre.json \
-m tool_trajectory_avg_score \
-m token_efficiency \
-m tool_efficiency \
-m time_efficiency \
--threshold 0.7
Alternatives considered
No response
Additional context
Three new evaluators following the custom evaluator protocol, but shipped as builtins:
src/agentevals/evaluator/token_efficiency.py
src/agentevals/evaluator/tool_efficiency.py
src/agentevals/evaluator/time_efficiency.py
They use extract_performance_metrics() from trace_metrics.py which already extracts token counts, latencies, and tool calls from OTel spans.
Human confirmation
Problem or motivation
agentevals currently focuses on correctness evaluation (tool trajectory matching, response quality). But in production, performance matters just as much — an agent that gets the right answer but burns 1M tokens and takes 5 minutes is not production-ready.
Proposed solution
Add three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed.
1. Token Efficiency (
token_efficiency)Scores how efficiently the agent used tokens relative to a budget.
Scoring:
gen_ai.usage.input_tokens+gen_ai.usage.output_tokensfrom trace spans1.0 - (actual_tokens / max_tokens), clamped to [0, 1]Why it matters: From our AI Agent Benchmark, token usage varied 8x across solutions for the same task (185K vs 1.6M tokens). This evaluator catches regressions.
2. Tool Efficiency (
tool_efficiency)Scores whether the agent used tools effectively — penalizes waste.
Scoring:
(useful_calls / total_calls) * (1.0 - budget_overrun_penalty)What it catches:
3. Time Efficiency (
time_efficiency)Scores how quickly the agent resolved relative to a budget.
Scoring:
1.0 - (actual_duration / max_duration), clamped to [0, 1]Eval Set Integration
These evaluators can be combined with
performance_budgetin eval cases:{ "eval_id": "crashloop_diagnosis", "conversation": [...], "performance_budget": { "max_tokens": 200000, "max_duration_s": 120, "max_tool_calls": 10 } }CI/CD Gating
Alternatives considered
No response
Additional context
Three new evaluators following the custom evaluator protocol, but shipped as builtins:
src/agentevals/evaluator/token_efficiency.pysrc/agentevals/evaluator/tool_efficiency.pysrc/agentevals/evaluator/time_efficiency.pyThey use
extract_performance_metrics()fromtrace_metrics.pywhich already extracts token counts, latencies, and tool calls from OTel spans.Human confirmation