feat: add token_efficiency, tool_efficiency, time_efficiency evaluators ⁠ by henrikrexed · Pull Request #7 · agentevals-dev/evaluators

henrikrexed · 2026-03-30T12:18:53Z

Summary

Three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed. Closes #6

Evaluators

⁠ token_efficiency ⁠

Scores how efficiently the agent used tokens relative to a budget.
•⁠ ⁠Extracts ⁠ input_tokens ⁠ + ⁠ output_tokens ⁠ from ⁠ performance_metrics ⁠ when available
•⁠ ⁠Weighted scoring (default: 70% input, 30% output — input tokens are the cost driver)
•⁠ ⁠Falls back to tool-call-based estimation when no trace data is present
•⁠ ⁠Score: ⁠ 1.0 - (weighted_total / budget) ⁠, clamped to [0, 1]

⁠ tool_efficiency ⁠

Scores whether the agent used tools effectively — penalizes waste.
•⁠ ⁠Detects duplicate calls (same tool + same args called twice)
•⁠ ⁠Detects error calls (tool responses with error indicators)
•⁠ ⁠Applies budget overrun penalty when total calls exceed ⁠ max_tool_calls ⁠
•⁠ ⁠Score: ⁠ (useful_calls / total_calls) × budget_factor ⁠

⁠ time_efficiency ⁠

Scores how quickly the agent resolved relative to a time budget.
•⁠ ⁠Extracts ⁠ duration_s ⁠ from ⁠ performance_metrics ⁠
•⁠ ⁠Score: ⁠ 1.0 - (actual / budget) ⁠, clamped to [0, 1]
•⁠ ⁠Returns neutral score (0.5) when no timing data is available

Config

⁠ yaml
metrics:

name: token_efficiency
type: remote
source: github
ref: evaluators/token_efficiency/token_efficiency.py
threshold: 0.7
config:
max_tokens: 200000
weight_input: 0.7
weight_output: 0.3
name: tool_efficiency
type: remote
source: github
ref: evaluators/tool_efficiency/tool_efficiency.py
threshold: 0.7
config:
max_tool_calls: 15
penalize_duplicates: true
penalize_errors: true
name: time_efficiency
type: remote
source: github
ref: evaluators/time_efficiency/time_efficiency.py
threshold: 0.7
config:
max_duration_s: 120
 ⁠

Why it matters

From the k8s AI agent benchmark, token usage varied 8x across agents for the same task (185K vs 1.6M tokens). These evaluators catch performance regressions in CI/CD without requiring LLM judges.

Testing

All 3 evaluators pass ⁠ validate_evaluator.py ⁠:
•⁠ ⁠✅ Manifest validation
•⁠ ⁠✅ Python syntax + SDK imports
•⁠ ⁠✅ Smoke run with standard test input
•⁠ ⁠✅ Tested with realistic kagent-style tool call patterns

@evaluator

…rs (issue agentevals-dev#6) Three built-in performance evaluators that score agents from trace data: - token_efficiency: scores token usage vs budget (weighted input/output) - tool_efficiency: penalizes duplicate calls, errors, and budget overruns - time_efficiency: scores resolution time vs budget All evaluators: - Follow the @evaluator SDK pattern (stdin/stdout JSON protocol) - Handle missing performance_metrics gracefully (neutral scores) - Support per-invocation scoring - Pass validate_evaluator.py smoke tests Closes agentevals-dev#6

krisztianfekete

Thank you for the PR, added some comments!

The most important question to solve is how to handle evals based on (performance) data that is not part of the original Invocation model.
@peterj, should we extend the SDK to enable passing these trough optionally?

Also, can you please

follow existing conventions and return EvalStatus.NOT_EVALUATED when invocations lack their required metrics?
remove overly verbose comments

krisztianfekete · 2026-03-30T12:32:48Z

.gitignore

@@ -1 +1 @@
-.venv/
+.venv/evaluators/__pycache__


I guess you meant this to be two lines to ignore pychache?

⁠Fixed in ⁠ 8371bc0 ⁠ — ⁠ .venv/ ⁠ and ⁠ pycache/ ⁠ are now on separate lines.

krisztianfekete · 2026-03-30T12:34:21Z

evaluators/token_efficiency/token_efficiency.py

+    # Check performance_budget on invocation (eval_set integration)
+    budget = getattr(inv, "performance_budget", None)
+    if budget is None and hasattr(inv, "__getitem__"):
+        try:
+            budget = inv["performance_budget"]
+        except (KeyError, TypeError):
+            pass
+
+    # No token data available
+    return None


Can you confirm that this is dead code?

⁠Confirmed, it was dead code. The ⁠ performance_budget ⁠ block extracted the value but never used it — removed in ⁠ 8371bc0 ⁠.

krisztianfekete · 2026-03-30T12:37:41Z

evaluators/tool_efficiency/tool_efficiency.py

+    details_items: list[str] = []
+
+    for inv in input.invocations:
+        tool_calls = inv.intermediate_steps.tool_calls if inv.intermediate_steps else []


This is the first time to use fields not part of the standard ADK Invocation format, we'll have to think a bit about how to go about these.

krisztianfekete · 2026-03-30T12:41:14Z

evaluators/tool_efficiency/tool_efficiency.py

+    output = response.get("output", "") if isinstance(response, dict) else getattr(response, "output", "")
+    output_str = str(output).lower()
+    # Check common error indicators
+    if any(marker in output_str for marker in ["error", "failed", "exception", "traceback"]):


I think we have to be more sophisticated here, as you can have these substrings in perfectly fine outputs as well.

This is still relevant, right?

Yes, still relevant . ⁠ _is_error_response ⁠ is used by ⁠ tool_efficiency ⁠ to detect failed tool calls when ⁠ penalize_errors=true ⁠. It checks the tool response output for common error markers (⁠ error ⁠, ⁠ failed ⁠, ⁠ exception ⁠, ⁠ traceback ⁠) and the ⁠ status ⁠ field. This is a heuristic since there's no standardized error field in the current tool response format. If the SDK adds a formal error/status field to tool responses in the future, we could use that instead.

Fixed in ⁠ 4e9899d ⁠ — removed the text-based heuristic entirely. ⁠ _is_error_response ⁠ now only checks the structured ⁠ status ⁠ field for ⁠ error ⁠/⁠ failed ⁠/⁠ failure ⁠. No more false positives from output text.

krisztianfekete · 2026-03-30T12:50:42Z

evaluators/tool_efficiency/tool_efficiency.py

+        if total == 0:
+            scores.append(1.0)  # No tools needed = perfectly efficient
+            details_items.append(f"{inv.invocation_id}: no tool calls (score: 1.0)")
+            continue


Not sure if we should return a perfect score here. Many times zero tool means a failure. We also have tool_coverage to check for minimum tool usage.

Maybe we should make this configurable?

Fair point — zero tool calls often means the agent hallucinated an answer instead of using its tools. Returning 1.0 here is misleading.

I'd suggest adding a ⁠ min_tool_calls ⁠ config (default 0 for backward compat). When set, zero calls scores 0.0 instead of 1.0. And when ⁠ min_tool_calls=0 ⁠ (explicitly "tools are optional"), zero calls still scores 1.0.

⁠ yaml
config:
max_tool_calls: 15
min_tool_calls: 1 # 0 = tools optional, >0 = penalize no-tool runs
 ⁠

This keeps ⁠ tool_efficiency ⁠ focused on efficiency while letting users opt into "tools are required". For strict "did the agent use tools at all" checks, ⁠ tool_coverage ⁠ is the right evaluator — they complement each other.

krisztianfekete · 2026-03-30T12:54:34Z

evaluators/token_efficiency/token_efficiency.py

+        weighted_total = (input_t * weight_input) + (output_t * weight_output)
+        weighted_budget = max_tokens * 1.0  # Budget applies to weighted total
+
+        score = max(0.0, min(1.0, 1.0 - (weighted_total / weighted_budget)))


Maybe we should go with a separate max_input and max_output so max will behave like maximum instead of weighted sum?

Good point — separate ⁠ max_input_tokens ⁠ and ⁠ max_output_tokens ⁠ would be clearer and more intuitive. The weighted approach tried to capture that input tokens cost less than output tokens (prefill vs generation), but expressing it as two separate budgets is simpler to reason about:

⁠ yaml
config:
max_input_tokens: 150000
max_output_tokens: 50000
 ⁠

Score becomes: ⁠ min(input_score, output_score) ⁠ where each is ⁠ 1.0 - (actual / max) ⁠ clamped to [0,1]. An agent that blows either budget gets penalized.

This also aligns better with how LLM providers price their APIs (separate input/output rates). I'll rework the evaluator. Should I keep ⁠ max_tokens ⁠ as a single-budget fallback for backwards compatibility, or go clean with only ⁠ max_input_tokens ⁠ / ⁠ max_output_tokens ⁠?

krisztianfekete · 2026-03-30T12:57:32Z

evaluators/tool_efficiency/tool_efficiency.py

+        for call in tool_calls:
+            sig = _call_signature(call)
+            seen_signatures[sig] = seen_signatures.get(sig, 0) + 1


Can we move this into the conditional below to avoid unnecessary work?

⁠Good call — no need to compute signatures at all when ⁠ penalize_duplicates ⁠ is disabled. Will move the signature counting inside the conditional. Fixed in next push.

… trim comments - Fix .gitignore: separate .venv/ and __pycache__/ on own lines - token_efficiency: remove unused performance_budget block, return NOT_EVALUATED when no token data available - time_efficiency: return NOT_EVALUATED when no duration data - All evaluators: trim verbose comments, cleaner code - All pass validate_evaluator.py

henrikrexed · 2026-03-30T15:04:08Z

Thanks @krisztianfekete — all addressed in the latest commit:
•⁠ ⁠✅ ⁠ .gitignore ⁠ fixed (separate lines)
•⁠ ⁠✅ Dead ⁠ performance_budget ⁠ code removed from ⁠ token_efficiency ⁠
•⁠ ⁠✅ ⁠ EvalStatus.NOT_EVALUATED ⁠ returned when invocations lack required metrics (token data, duration)
•⁠ ⁠✅ Verbose comments trimmed across all 3 evaluators

On the SDK/Invocation model question:
Agreed this is the key design decision. Currently ⁠ token_efficiency ⁠ and ⁠ time_efficiency ⁠ look for ⁠ performance_metrics ⁠ on the invocation (via ⁠ getattr ⁠ fallback), which works but isn't part of the formal ⁠ Invocation ⁠ schema.

I'd suggest extending the SDK's ⁠ Invocation ⁠ model with an optional ⁠ performance_metrics: Optional[dict] ⁠ field — a simple dict that the trace loader populates from OTel span attributes:

⁠ python
class Invocation(BaseModel):
# ... existing fields ...
performance_metrics: Optional[dict] = None # e.g. {"input_tokens": 1234, "output_tokens": 567, "duration_s": 12.3}
 ⁠

This keeps it flexible (dict, not rigid schema) so different trace sources can pass whatever performance data they extract, and evaluators pick what they need. The ⁠ tool_efficiency ⁠ evaluator works fine with the existing model since it only uses ⁠ intermediate_steps.tool_calls ⁠ which is already standard.

Happy to open a separate issue/PR on the SDK side if that helps.

…ld only _is_error_response now only checks the structured status field (error/failed/failure) instead of scanning output text for substrings like "error" which would false-positive on legitimate outputs.

…ing into conditional - Add min_tool_calls config (default 0): when >0, zero tool calls scores 0.0 instead of 1.0 (zero calls often means hallucinated answer) - Move signature computation inside penalize_duplicates conditional to avoid unnecessary work when duplicate detection is disabled - Complements tool_coverage for strict tool-usage requirements

peterj · 2026-04-06T10:18:04Z

evaluators/time_efficiency/evaluator.yaml

@@ -0,0 +1,6 @@
+name: time_efficiency
+description: Scores how quickly the agent resolved relative to a time budget
+language: python


can you remove this file 'evaluators/bertscore/pycache/bertscore.cpython-314.pyc' ?

Good catch! This file was already removed in commit 08b0905. The .gitignore also includes pycache/ so it won't be accidentally committed again.

krisztianfekete · 2026-04-07T16:17:39Z

@henrikrexed the SDK changes required to make this work have been merged and released. Can you please take another look incorporate the changes and test manually to ensure everything works e2e?

- Remove getattr/dict-access fallbacks in token_efficiency and time_efficiency now that SDK 0.1.1 has performance_metrics as a proper field on InvocationData - Replace weighted max_tokens with separate max_input_tokens and max_output_tokens config (score = min of both), per review feedback - All three evaluators tested e2e with SDK 0.1.1 Co-Authored-By: Paperclip <noreply@paperclip.ing>

henrikrexed · 2026-04-07T16:47:34Z

@krisztianfekete Updated to use the new SDK (0.1.1). Changes in commit c5d1684:

token_efficiency:
Now uses inv.performance_metrics directly — removed the getattr/dict-access fallback since the SDK has performance_metrics as a proper Pydantic field
Replaced weighted max_tokens with separate max_input_tokens (default 150k) and max_output_tokens (default 50k) per your review feedback
Score = min(input_score, output_score) — blowing either budget penalizes the score

time_efficiency:
Same cleanup — direct inv.performance_metrics access, removed fallback hack

tool_efficiency:
No changes needed (already uses standard SDK fields intermediate_steps.tool_calls / tool_responses)
All three evaluators tested e2e against SDK 0.1.1. Ready for review.

krisztianfekete

Can you please share some example results from your manual e2e testing?

krisztianfekete · 2026-04-07T19:32:54Z

evaluators/token_efficiency/token_efficiency.py

+    input_t = perf.get("input_tokens") or perf.get("prompt_tokens")
+    output_t = perf.get("output_tokens") or perf.get("completion_tokens")


Here, or will drop zero token values.

krisztianfekete · 2026-04-07T19:34:37Z

evaluators/time_efficiency/time_efficiency.py

+            continue
+
+        has_data = True
+        score = max(0.0, min(1.0, 1.0 - (duration / max_duration)))


Please add a guard against 0 values here and in tool_efficiency.

krisztianfekete · 2026-04-07T19:37:35Z

evaluators/tool_efficiency/tool_efficiency.py

+
+
+def _call_signature(call) -> str:
+    name = call.get("name", "") if isinstance(call, dict) else getattr(call, "name", "")


Can you please just use attribute access to match the codebase conventions?

krisztianfekete · 2026-04-07T19:38:23Z

evaluators/tool_efficiency/tool_efficiency.py

Please return NOT_EVALUATED when it makes sense to keep it consistent with other evaluators.

krisztianfekete · 2026-04-07T19:39:20Z

evaluators/time_efficiency/time_efficiency.py

+        )
+
+    overall = sum(scores) / len(scores) if scores else 0.0
+    return EvalResult(score=overall, per_invocation_scores=scores, details={"time_details": details_items})


Please use issues for consistency.

henrikrexed mentioned this pull request Mar 30, 2026

Feature: efficiency evaluator for carbon-aware agent evaluation ⁠ #8

Open

krisztianfekete reviewed Mar 30, 2026

View reviewed changes

krisztianfekete requested a review from peterj March 30, 2026 13:05

henrikrexed added 2 commits March 30, 2026 18:18

fix: remove naive text heuristic from error detection, use status fie…

4e9899d

…ld only _is_error_response now only checks the structured status field (error/failed/failure) instead of scanning output text for substrings like "error" which would false-positive on legitimate outputs.

peterj reviewed Apr 6, 2026

View reviewed changes

krisztianfekete mentioned this pull request Apr 7, 2026

Extend SDK with perf metrics agentevals-dev/agentevals#108

Merged

krisztianfekete requested changes Apr 7, 2026

View reviewed changes

		@@ -1 +1 @@
		.venv/ No newline at end of file
		.venv/evaluators/__pycache__

		input_t = perf.get("input_tokens") or perf.get("prompt_tokens")
		output_t = perf.get("output_tokens") or perf.get("completion_tokens")



		def _call_signature(call) -> str:
		name = call.get("name", "") if isinstance(call, dict) else getattr(call, "name", "")

Conversation

henrikrexed commented Mar 30, 2026

Summary

Evaluators

⁠ token_efficiency ⁠

⁠ tool_efficiency ⁠

⁠ time_efficiency ⁠

Config

Why it matters

Testing

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henrikrexed commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisztianfekete commented Apr 7, 2026

Uh oh!

henrikrexed commented Apr 7, 2026

Uh oh!

krisztianfekete left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⁠ token_efficiency ⁠

⁠ tool_efficiency ⁠

⁠ time_efficiency ⁠