Skip to content

Feature: efficiency evaluator for carbon-aware agent evaluation ⁠ #8

@henrikrexed

Description

@henrikrexed

agentevals now has performance evaluators for tokens, tools, and time (PR #7). But none of them measure the environmental cost of an agent's work.

AI agents are heavy compute consumers — each LLM call burns GPU cycles, each tool call hits APIs, and inefficient agents multiply this waste. As AI agent adoption grows, we need a way to evaluate and gate deployments based on energy efficiency, not just correctness.

Teams building responsible AI need to answer: "Is this agent getting greener over time, or are we regressing?"

Proposed solution

Add an ⁠ energy_efficiency ⁠ evaluator that scores how energy-efficient an agent run was, using a compute cost model derived from trace data.

Scoring approach

Energy can't be measured directly from traces, but we can estimate relative compute cost from observable signals:

energy_score = 1.0 - (estimated_energy / energy_budget)

Where ⁠ estimated_energy ⁠ is calculated from:

Total tokens (input + output)
•⁠ ⁠Weight: High
•⁠ ⁠Rationale: Direct proxy for GPU compute time

Number of LLM calls
•⁠ ⁠Weight: Medium
•⁠ ⁠Rationale: Each call has fixed overhead (network, scheduling, KV cache init)

Model tier (small/medium/large)
•⁠ ⁠Weight: High
•⁠ ⁠Rationale: A 70B model uses ~10x the energy of a 7B model per token

Tool calls (external APIs)
•⁠ ⁠Weight: Low
•⁠ ⁠Rationale: Each HTTP call has network + server-side compute

Total duration
•⁠ ⁠Weight: Low
•⁠ ⁠Rationale: Longer runs = more idle compute reservation

Config

evaluators:

  • name: energy_efficiency
    type: builtin
    config:
    energy_budget: 1.0 # relative energy units (1.0 = baseline)
    model_tier: "large" # small|medium|large|custom
    cost_per_1k_input_tokens: 0.5 # relative energy units
    cost_per_1k_output_tokens: 1.5 # output is ~3x more expensive (generation vs prefill)
    cost_per_llm_call: 0.1 # fixed overhead per call
    cost_per_tool_call: 0.05 # external API cost
    carbon_intensity_gco2_kwh: null # optional: grid carbon intensity for CO2 estimation

Model tier energy multipliers

Based on published GPU benchmarks and inference costs:

⁠ small ⁠
•⁠ ⁠Examples: GPT-4o-mini, Claude Haiku, Llama 8B
•⁠ ⁠Multiplier: 1x

⁠ medium ⁠
•⁠ ⁠Examples: GPT-4o, Claude Sonnet, Llama 70B
•⁠ ⁠Multiplier: 5x

⁠ large ⁠
•⁠ ⁠Examples: GPT-4, Claude Opus, Llama 405B
•⁠ ⁠Multiplier: 15x

These are rough but useful for relative comparison between agent versions.

Output

{
"score": 0.72,
"details": {
"estimated_energy_units": 0.28,
"energy_budget": 1.0,
"breakdown": {
"token_compute": 0.18,
"llm_call_overhead": 0.05,
"tool_call_overhead": 0.03,
"model_tier_multiplier": 5
},
"estimated_co2_grams": 2.4,
"equivalent": "~1 Google search"
}
}

CI/CD gating

agentevals run trace.json
--eval-set k8s-sre.json
-m energy_efficiency
--threshold 0.6

Teams can set energy budgets per task and fail builds when agents regress.

Future extensions

•⁠ ⁠Carbon intensity integration: Use real-time grid data (e.g., Electricity Maps API) to convert energy estimates to actual CO2
•⁠ ⁠Model-specific energy profiles: Allow custom Wh/token values from published model cards
•⁠ ⁠Comparative scoring: "This agent uses 3x less energy than baseline for the same task"
•⁠ ⁠Multi-run trending: Track energy efficiency over time across eval runs
•⁠ ⁠Hardware-aware scoring: Factor in GPU type (A100 vs H100 vs inference chips)

Why it matters

•⁠ ⁠Duplicate tool calls and retry loops multiply energy waste
•⁠ ⁠As agents scale to millions of daily invocations, this becomes significant

Additional context

This evaluator complements ⁠ token_efficiency ⁠, ⁠ tool_efficiency ⁠, and ⁠ time_efficiency ⁠ by combining multiple signals into a single energy-aware score. It goes beyond counting tokens to model the actual compute cost shape.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions