Skip to content

feat: add token_efficiency, tool_efficiency, time_efficiency evaluators ⁠#7

Open
henrikrexed wants to merge 5 commits intoagentevals-dev:mainfrom
henrikrexed:feat/performance-evaluators
Open

feat: add token_efficiency, tool_efficiency, time_efficiency evaluators ⁠#7
henrikrexed wants to merge 5 commits intoagentevals-dev:mainfrom
henrikrexed:feat/performance-evaluators

Conversation

@henrikrexed
Copy link
Copy Markdown

Summary

Three built-in performance evaluators that score agents automatically from trace data — no human in the loop, no LLM judge needed. Closes #6

Evaluators

⁠ token_efficiency ⁠

Scores how efficiently the agent used tokens relative to a budget.
•⁠ ⁠Extracts ⁠ input_tokens ⁠ + ⁠ output_tokens ⁠ from ⁠ performance_metrics ⁠ when available
•⁠ ⁠Weighted scoring (default: 70% input, 30% output — input tokens are the cost driver)
•⁠ ⁠Falls back to tool-call-based estimation when no trace data is present
•⁠ ⁠Score: ⁠ 1.0 - (weighted_total / budget) ⁠, clamped to [0, 1]

⁠ tool_efficiency ⁠

Scores whether the agent used tools effectively — penalizes waste.
•⁠ ⁠Detects duplicate calls (same tool + same args called twice)
•⁠ ⁠Detects error calls (tool responses with error indicators)
•⁠ ⁠Applies budget overrun penalty when total calls exceed ⁠ max_tool_calls ⁠
•⁠ ⁠Score: ⁠ (useful_calls / total_calls) × budget_factor ⁠

⁠ time_efficiency ⁠

Scores how quickly the agent resolved relative to a time budget.
•⁠ ⁠Extracts ⁠ duration_s ⁠ from ⁠ performance_metrics ⁠
•⁠ ⁠Score: ⁠ 1.0 - (actual / budget) ⁠, clamped to [0, 1]
•⁠ ⁠Returns neutral score (0.5) when no timing data is available

Config

⁠ yaml
metrics:

  • name: token_efficiency
    type: remote
    source: github
    ref: evaluators/token_efficiency/token_efficiency.py
    threshold: 0.7
    config:
    max_tokens: 200000
    weight_input: 0.7
    weight_output: 0.3

  • name: tool_efficiency
    type: remote
    source: github
    ref: evaluators/tool_efficiency/tool_efficiency.py
    threshold: 0.7
    config:
    max_tool_calls: 15
    penalize_duplicates: true
    penalize_errors: true

  • name: time_efficiency
    type: remote
    source: github
    ref: evaluators/time_efficiency/time_efficiency.py
    threshold: 0.7
    config:
    max_duration_s: 120
     ⁠

Why it matters

From the k8s AI agent benchmark, token usage varied 8x across agents for the same task (185K vs 1.6M tokens). These evaluators catch performance regressions in CI/CD without requiring LLM judges.

Testing

All 3 evaluators pass ⁠ validate_evaluator.py ⁠:
•⁠ ⁠✅ Manifest validation
•⁠ ⁠✅ Python syntax + SDK imports
•⁠ ⁠✅ Smoke run with standard test input
•⁠ ⁠✅ Tested with realistic kagent-style tool call patterns

…rs (issue agentevals-dev#6)

Three built-in performance evaluators that score agents from trace data:

- token_efficiency: scores token usage vs budget (weighted input/output)
- tool_efficiency: penalizes duplicate calls, errors, and budget overruns
- time_efficiency: scores resolution time vs budget

All evaluators:
- Follow the @evaluator SDK pattern (stdin/stdout JSON protocol)
- Handle missing performance_metrics gracefully (neutral scores)
- Support per-invocation scoring
- Pass validate_evaluator.py smoke tests

Closes agentevals-dev#6
Copy link
Copy Markdown
Collaborator

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, added some comments!

The most important question to solve is how to handle evals based on (performance) data that is not part of the original Invocation model.
@peterj, should we extend the SDK to enable passing these trough optionally?

Also, can you please

  • follow existing conventions and return EvalStatus.NOT_EVALUATED when invocations lack their required metrics?
  • remove overly verbose comments

.gitignore Outdated
@@ -1 +1 @@
.venv/ No newline at end of file
.venv/evaluators/__pycache__
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant this to be two lines to ignore pychache?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⁠Fixed in ⁠ 8371bc0 ⁠ — ⁠ .venv/ ⁠ and ⁠ pycache/ ⁠ are now on separate lines.

Comment on lines +35 to +44
# Check performance_budget on invocation (eval_set integration)
budget = getattr(inv, "performance_budget", None)
if budget is None and hasattr(inv, "__getitem__"):
try:
budget = inv["performance_budget"]
except (KeyError, TypeError):
pass

# No token data available
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that this is dead code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⁠Confirmed, it was dead code. The ⁠ performance_budget ⁠ block extracted the value but never used it — removed in ⁠ 8371bc0 ⁠.

details_items: list[str] = []

for inv in input.invocations:
tool_calls = inv.intermediate_steps.tool_calls if inv.intermediate_steps else []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time to use fields not part of the standard ADK Invocation format, we'll have to think a bit about how to go about these.

output = response.get("output", "") if isinstance(response, dict) else getattr(response, "output", "")
output_str = str(output).lower()
# Check common error indicators
if any(marker in output_str for marker in ["error", "failed", "exception", "traceback"]):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to be more sophisticated here, as you can have these substrings in perfectly fine outputs as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relevant, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, still relevant . ⁠ _is_error_response ⁠ is used by ⁠ tool_efficiency ⁠ to detect failed tool calls when ⁠ penalize_errors=true ⁠. It checks the tool response output for common error markers (⁠ error ⁠, ⁠ failed ⁠, ⁠ exception ⁠, ⁠ traceback ⁠) and the ⁠ status ⁠ field. This is a heuristic since there's no standardized error field in the current tool response format. If the SDK adds a formal error/status field to tool responses in the future, we could use that instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ⁠ 4e9899d ⁠ — removed the text-based heuristic entirely. ⁠ _is_error_response ⁠ now only checks the structured ⁠ status ⁠ field for ⁠ error ⁠/⁠ failed ⁠/⁠ failure ⁠. No more false positives from output text.

Comment on lines +56 to +59
if total == 0:
scores.append(1.0) # No tools needed = perfectly efficient
details_items.append(f"{inv.invocation_id}: no tool calls (score: 1.0)")
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should return a perfect score here. Many times zero tool means a failure. We also have tool_coverage to check for minimum tool usage.

Maybe we should make this configurable?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — zero tool calls often means the agent hallucinated an answer instead of using its tools. Returning 1.0 here is misleading.

I'd suggest adding a ⁠ min_tool_calls ⁠ config (default 0 for backward compat). When set, zero calls scores 0.0 instead of 1.0. And when ⁠ min_tool_calls=0 ⁠ (explicitly "tools are optional"), zero calls still scores 1.0.

⁠ yaml
config:
max_tool_calls: 15
min_tool_calls: 1 # 0 = tools optional, >0 = penalize no-tool runs
 ⁠

This keeps ⁠ tool_efficiency ⁠ focused on efficiency while letting users opt into "tools are required". For strict "did the agent use tools at all" checks, ⁠ tool_coverage ⁠ is the right evaluator — they complement each other.

Comment on lines +80 to +83
weighted_total = (input_t * weight_input) + (output_t * weight_output)
weighted_budget = max_tokens * 1.0 # Budget applies to weighted total

score = max(0.0, min(1.0, 1.0 - (weighted_total / weighted_budget)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should go with a separate max_input and max_output so max will behave like maximum instead of weighted sum?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — separate ⁠ max_input_tokens ⁠ and ⁠ max_output_tokens ⁠ would be clearer and more intuitive. The weighted approach tried to capture that input tokens cost less than output tokens (prefill vs generation), but expressing it as two separate budgets is simpler to reason about:

⁠ yaml
config:
max_input_tokens: 150000
max_output_tokens: 50000
 ⁠

Score becomes: ⁠ min(input_score, output_score) ⁠ where each is ⁠ 1.0 - (actual / max) ⁠ clamped to [0,1]. An agent that blows either budget gets penalized.

This also aligns better with how LLM providers price their APIs (separate input/output rates). I'll rework the evaluator. Should I keep ⁠ max_tokens ⁠ as a single-budget fallback for backwards compatibility, or go clean with only ⁠ max_input_tokens ⁠ / ⁠ max_output_tokens ⁠?

Comment on lines +64 to +66
for call in tool_calls:
sig = _call_signature(call)
seen_signatures[sig] = seen_signatures.get(sig, 0) + 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into the conditional below to avoid unnecessary work?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⁠Good call — no need to compute signatures at all when ⁠ penalize_duplicates ⁠ is disabled. Will move the signature counting inside the conditional. Fixed in next push.

@krisztianfekete krisztianfekete requested a review from peterj March 30, 2026 13:05
… trim comments

- Fix .gitignore: separate .venv/ and __pycache__/ on own lines
- token_efficiency: remove unused performance_budget block, return
  NOT_EVALUATED when no token data available
- time_efficiency: return NOT_EVALUATED when no duration data
- All evaluators: trim verbose comments, cleaner code
- All pass validate_evaluator.py
@henrikrexed
Copy link
Copy Markdown
Author

Thanks @krisztianfekete — all addressed in the latest commit:
•⁠ ⁠✅ ⁠ .gitignore ⁠ fixed (separate lines)
•⁠ ⁠✅ Dead ⁠ performance_budget ⁠ code removed from ⁠ token_efficiency ⁠
•⁠ ⁠✅ ⁠ EvalStatus.NOT_EVALUATED ⁠ returned when invocations lack required metrics (token data, duration)
•⁠ ⁠✅ Verbose comments trimmed across all 3 evaluators

On the SDK/Invocation model question:
Agreed this is the key design decision. Currently ⁠ token_efficiency ⁠ and ⁠ time_efficiency ⁠ look for ⁠ performance_metrics ⁠ on the invocation (via ⁠ getattr ⁠ fallback), which works but isn't part of the formal ⁠ Invocation ⁠ schema.

I'd suggest extending the SDK's ⁠ Invocation ⁠ model with an optional ⁠ performance_metrics: Optional[dict] ⁠ field — a simple dict that the trace loader populates from OTel span attributes:

⁠ python
class Invocation(BaseModel):
# ... existing fields ...
performance_metrics: Optional[dict] = None # e.g. {"input_tokens": 1234, "output_tokens": 567, "duration_s": 12.3}
 ⁠

This keeps it flexible (dict, not rigid schema) so different trace sources can pass whatever performance data they extract, and evaluators pick what they need. The ⁠ tool_efficiency ⁠ evaluator works fine with the existing model since it only uses ⁠ intermediate_steps.tool_calls ⁠ which is already standard.

Happy to open a separate issue/PR on the SDK side if that helps.

…ld only

_is_error_response now only checks the structured status field (error/failed/failure)
instead of scanning output text for substrings like "error" which would
false-positive on legitimate outputs.
…ing into conditional

- Add min_tool_calls config (default 0): when >0, zero tool calls scores 0.0
  instead of 1.0 (zero calls often means hallucinated answer)
- Move signature computation inside penalize_duplicates conditional to
  avoid unnecessary work when duplicate detection is disabled
- Complements tool_coverage for strict tool-usage requirements
@@ -0,0 +1,6 @@
name: time_efficiency
description: Scores how quickly the agent resolved relative to a time budget
language: python
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this file 'evaluators/bertscore/pycache/bertscore.cpython-314.pyc' ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This file was already removed in commit 08b0905. The .gitignore also includes pycache/ so it won't be accidentally committed again.

@krisztianfekete
Copy link
Copy Markdown
Collaborator

@henrikrexed the SDK changes required to make this work have been merged and released. Can you please take another look incorporate the changes and test manually to ensure everything works e2e?

- Remove getattr/dict-access fallbacks in token_efficiency and
  time_efficiency now that SDK 0.1.1 has performance_metrics as a
  proper field on InvocationData
- Replace weighted max_tokens with separate max_input_tokens and
  max_output_tokens config (score = min of both), per review feedback
- All three evaluators tested e2e with SDK 0.1.1

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@henrikrexed
Copy link
Copy Markdown
Author

@krisztianfekete Updated to use the new SDK (0.1.1). Changes in commit c5d1684:

token_efficiency:
Now uses inv.performance_metrics directly — removed the getattr/dict-access fallback since the SDK has performance_metrics as a proper Pydantic field
Replaced weighted max_tokens with separate max_input_tokens (default 150k) and max_output_tokens (default 50k) per your review feedback
Score = min(input_score, output_score) — blowing either budget penalizes the score

time_efficiency:
Same cleanup — direct inv.performance_metrics access, removed fallback hack

tool_efficiency:
No changes needed (already uses standard SDK fields intermediate_steps.tool_calls / tool_responses)
All three evaluators tested e2e against SDK 0.1.1. Ready for review.

Copy link
Copy Markdown
Collaborator

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share some example results from your manual e2e testing?

Comment on lines +17 to +18
input_t = perf.get("input_tokens") or perf.get("prompt_tokens")
output_t = perf.get("output_tokens") or perf.get("completion_tokens")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, or will drop zero token values.

continue

has_data = True
score = max(0.0, min(1.0, 1.0 - (duration / max_duration)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a guard against 0 values here and in tool_efficiency.



def _call_signature(call) -> str:
name = call.get("name", "") if isinstance(call, dict) else getattr(call, "name", "")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please just use attribute access to match the codebase conventions?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return NOT_EVALUATED when it makes sense to keep it consistent with other evaluators.

)

overall = sum(scores) / len(scores) if scores else 0.0
return EvalResult(score=overall, per_invocation_scores=scores, details={"time_details": details_items})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use issues for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution

3 participants