feat: add token_efficiency, tool_efficiency, time_efficiency evaluators #7
feat: add token_efficiency, tool_efficiency, time_efficiency evaluators #7henrikrexed wants to merge 5 commits intoagentevals-dev:mainfrom
Conversation
…rs (issue agentevals-dev#6) Three built-in performance evaluators that score agents from trace data: - token_efficiency: scores token usage vs budget (weighted input/output) - tool_efficiency: penalizes duplicate calls, errors, and budget overruns - time_efficiency: scores resolution time vs budget All evaluators: - Follow the @evaluator SDK pattern (stdin/stdout JSON protocol) - Handle missing performance_metrics gracefully (neutral scores) - Support per-invocation scoring - Pass validate_evaluator.py smoke tests Closes agentevals-dev#6
krisztianfekete
left a comment
There was a problem hiding this comment.
Thank you for the PR, added some comments!
The most important question to solve is how to handle evals based on (performance) data that is not part of the original Invocation model.
@peterj, should we extend the SDK to enable passing these trough optionally?
Also, can you please
- follow existing conventions and return
EvalStatus.NOT_EVALUATEDwhen invocations lack their required metrics? - remove overly verbose comments
.gitignore
Outdated
| @@ -1 +1 @@ | |||
| .venv/ No newline at end of file | |||
| .venv/evaluators/__pycache__ | |||
There was a problem hiding this comment.
I guess you meant this to be two lines to ignore pychache?
There was a problem hiding this comment.
Fixed in 8371bc0 — .venv/ and pycache/ are now on separate lines.
| # Check performance_budget on invocation (eval_set integration) | ||
| budget = getattr(inv, "performance_budget", None) | ||
| if budget is None and hasattr(inv, "__getitem__"): | ||
| try: | ||
| budget = inv["performance_budget"] | ||
| except (KeyError, TypeError): | ||
| pass | ||
|
|
||
| # No token data available | ||
| return None |
There was a problem hiding this comment.
Can you confirm that this is dead code?
There was a problem hiding this comment.
Confirmed, it was dead code. The performance_budget block extracted the value but never used it — removed in 8371bc0 .
| details_items: list[str] = [] | ||
|
|
||
| for inv in input.invocations: | ||
| tool_calls = inv.intermediate_steps.tool_calls if inv.intermediate_steps else [] |
There was a problem hiding this comment.
This is the first time to use fields not part of the standard ADK Invocation format, we'll have to think a bit about how to go about these.
| output = response.get("output", "") if isinstance(response, dict) else getattr(response, "output", "") | ||
| output_str = str(output).lower() | ||
| # Check common error indicators | ||
| if any(marker in output_str for marker in ["error", "failed", "exception", "traceback"]): |
There was a problem hiding this comment.
I think we have to be more sophisticated here, as you can have these substrings in perfectly fine outputs as well.
There was a problem hiding this comment.
This is still relevant, right?
There was a problem hiding this comment.
Yes, still relevant . _is_error_response is used by tool_efficiency to detect failed tool calls when penalize_errors=true . It checks the tool response output for common error markers ( error , failed , exception , traceback ) and the status field. This is a heuristic since there's no standardized error field in the current tool response format. If the SDK adds a formal error/status field to tool responses in the future, we could use that instead.
There was a problem hiding this comment.
Fixed in 4e9899d — removed the text-based heuristic entirely. _is_error_response now only checks the structured status field for error / failed / failure . No more false positives from output text.
| if total == 0: | ||
| scores.append(1.0) # No tools needed = perfectly efficient | ||
| details_items.append(f"{inv.invocation_id}: no tool calls (score: 1.0)") | ||
| continue |
There was a problem hiding this comment.
Not sure if we should return a perfect score here. Many times zero tool means a failure. We also have tool_coverage to check for minimum tool usage.
Maybe we should make this configurable?
There was a problem hiding this comment.
Fair point — zero tool calls often means the agent hallucinated an answer instead of using its tools. Returning 1.0 here is misleading.
I'd suggest adding a min_tool_calls config (default 0 for backward compat). When set, zero calls scores 0.0 instead of 1.0. And when min_tool_calls=0 (explicitly "tools are optional"), zero calls still scores 1.0.
yaml
config:
max_tool_calls: 15
min_tool_calls: 1 # 0 = tools optional, >0 = penalize no-tool runs
This keeps tool_efficiency focused on efficiency while letting users opt into "tools are required". For strict "did the agent use tools at all" checks, tool_coverage is the right evaluator — they complement each other.
| weighted_total = (input_t * weight_input) + (output_t * weight_output) | ||
| weighted_budget = max_tokens * 1.0 # Budget applies to weighted total | ||
|
|
||
| score = max(0.0, min(1.0, 1.0 - (weighted_total / weighted_budget))) |
There was a problem hiding this comment.
Maybe we should go with a separate max_input and max_output so max will behave like maximum instead of weighted sum?
There was a problem hiding this comment.
Good point — separate max_input_tokens and max_output_tokens would be clearer and more intuitive. The weighted approach tried to capture that input tokens cost less than output tokens (prefill vs generation), but expressing it as two separate budgets is simpler to reason about:
yaml
config:
max_input_tokens: 150000
max_output_tokens: 50000
Score becomes: min(input_score, output_score) where each is 1.0 - (actual / max) clamped to [0,1]. An agent that blows either budget gets penalized.
This also aligns better with how LLM providers price their APIs (separate input/output rates). I'll rework the evaluator. Should I keep max_tokens as a single-budget fallback for backwards compatibility, or go clean with only max_input_tokens / max_output_tokens ?
| for call in tool_calls: | ||
| sig = _call_signature(call) | ||
| seen_signatures[sig] = seen_signatures.get(sig, 0) + 1 |
There was a problem hiding this comment.
Can we move this into the conditional below to avoid unnecessary work?
There was a problem hiding this comment.
Good call — no need to compute signatures at all when penalize_duplicates is disabled. Will move the signature counting inside the conditional. Fixed in next push.
… trim comments - Fix .gitignore: separate .venv/ and __pycache__/ on own lines - token_efficiency: remove unused performance_budget block, return NOT_EVALUATED when no token data available - time_efficiency: return NOT_EVALUATED when no duration data - All evaluators: trim verbose comments, cleaner code - All pass validate_evaluator.py
|
Thanks @krisztianfekete — all addressed in the latest commit: On the SDK/Invocation model question: I'd suggest extending the SDK's Invocation model with an optional performance_metrics: Optional[dict] field — a simple dict that the trace loader populates from OTel span attributes: python This keeps it flexible (dict, not rigid schema) so different trace sources can pass whatever performance data they extract, and evaluators pick what they need. The tool_efficiency evaluator works fine with the existing model since it only uses intermediate_steps.tool_calls which is already standard. Happy to open a separate issue/PR on the SDK side if that helps. |
…ld only _is_error_response now only checks the structured status field (error/failed/failure) instead of scanning output text for substrings like "error" which would false-positive on legitimate outputs.
…ing into conditional - Add min_tool_calls config (default 0): when >0, zero tool calls scores 0.0 instead of 1.0 (zero calls often means hallucinated answer) - Move signature computation inside penalize_duplicates conditional to avoid unnecessary work when duplicate detection is disabled - Complements tool_coverage for strict tool-usage requirements
| @@ -0,0 +1,6 @@ | |||
| name: time_efficiency | |||
| description: Scores how quickly the agent resolved relative to a time budget | |||
| language: python | |||
There was a problem hiding this comment.
can you remove this file 'evaluators/bertscore/pycache/bertscore.cpython-314.pyc' ?
There was a problem hiding this comment.
Good catch! This file was already removed in commit 08b0905. The .gitignore also includes pycache/ so it won't be accidentally committed again.
|
@henrikrexed the SDK changes required to make this work have been merged and released. Can you please take another look incorporate the changes and test manually to ensure everything works e2e? |
- Remove getattr/dict-access fallbacks in token_efficiency and time_efficiency now that SDK 0.1.1 has performance_metrics as a proper field on InvocationData - Replace weighted max_tokens with separate max_input_tokens and max_output_tokens config (score = min of both), per review feedback - All three evaluators tested e2e with SDK 0.1.1 Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
@krisztianfekete Updated to use the new SDK (0.1.1). Changes in commit c5d1684: token_efficiency: time_efficiency: tool_efficiency: |
krisztianfekete
left a comment
There was a problem hiding this comment.
Can you please share some example results from your manual e2e testing?
| input_t = perf.get("input_tokens") or perf.get("prompt_tokens") | ||
| output_t = perf.get("output_tokens") or perf.get("completion_tokens") |
There was a problem hiding this comment.
Here, or will drop zero token values.
| continue | ||
|
|
||
| has_data = True | ||
| score = max(0.0, min(1.0, 1.0 - (duration / max_duration))) |
There was a problem hiding this comment.
Please add a guard against 0 values here and in tool_efficiency.
|
|
||
|
|
||
| def _call_signature(call) -> str: | ||
| name = call.get("name", "") if isinstance(call, dict) else getattr(call, "name", "") |
There was a problem hiding this comment.
Can you please just use attribute access to match the codebase conventions?
There was a problem hiding this comment.
Please return NOT_EVALUATED when it makes sense to keep it consistent with other evaluators.
| ) | ||
|
|
||
| overall = sum(scores) / len(scores) if scores else 0.0 | ||
| return EvalResult(score=overall, per_invocation_scores=scores, details={"time_details": details_items}) |
There was a problem hiding this comment.
Please use issues for consistency.
Summary
Three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed. Closes #6
Evaluators
token_efficiency
Scores how efficiently the agent used tokens relative to a budget.
• Extracts input_tokens + output_tokens from performance_metrics when available
• Weighted scoring (default: 70% input, 30% output — input tokens are the cost driver)
• Falls back to tool-call-based estimation when no trace data is present
• Score: 1.0 - (weighted_total / budget) , clamped to [0, 1]
tool_efficiency
Scores whether the agent used tools effectively — penalizes waste.
• Detects duplicate calls (same tool + same args called twice)
• Detects error calls (tool responses with error indicators)
• Applies budget overrun penalty when total calls exceed max_tool_calls
• Score: (useful_calls / total_calls) × budget_factor
time_efficiency
Scores how quickly the agent resolved relative to a time budget.
• Extracts duration_s from performance_metrics
• Score: 1.0 - (actual / budget) , clamped to [0, 1]
• Returns neutral score (0.5) when no timing data is available
Config
yaml
metrics:
name: token_efficiency
type: remote
source: github
ref: evaluators/token_efficiency/token_efficiency.py
threshold: 0.7
config:
max_tokens: 200000
weight_input: 0.7
weight_output: 0.3
name: tool_efficiency
type: remote
source: github
ref: evaluators/tool_efficiency/tool_efficiency.py
threshold: 0.7
config:
max_tool_calls: 15
penalize_duplicates: true
penalize_errors: true
name: time_efficiency
type: remote
source: github
ref: evaluators/time_efficiency/time_efficiency.py
threshold: 0.7
config:
max_duration_s: 120
Why it matters
From the k8s AI agent benchmark, token usage varied 8x across agents for the same task (185K vs 1.6M tokens). These evaluators catch performance regressions in CI/CD without requiring LLM judges.
Testing
All 3 evaluators pass validate_evaluator.py :
• ✅ Manifest validation
• ✅ Python syntax + SDK imports
• ✅ Smoke run with standard test input
• ✅ Tested with realistic kagent-style tool call patterns