Offline scoring of saved agent trajectories against ground-truth scenarios.
The evaluation module follows the three-stage pattern used by SWE-bench, HELM, and τ-bench:
agent run → trajectory (run_id) → evaluate → reports/<run_id>.json
Re-scoring from saved trajectories is first-class: re-run with a different scorer or judge model without re-invoking the agent.
The vocabulary follows MLflow's evaluation split:
- Scenario — a ground-truth record on disk. Carries
id,text(the utterance),type,characteristic_form(expected behaviour), and optionalscoring_method. - Trajectory — a per-run JSON file persisted by the agent runners
when
AGENT_TRAJECTORY_DIRis set. Carriesrun_id,scenario_id,question,answer, and per-turn detail. - Scorer — a callable that takes
(scenario, answer, trajectory_text)and returns aScorerResult. Scorers fall into three families:- Code-Based — deterministic, no LLM.
static_jsonis implemented for structured outputs such as JSON objects, arrays, Python-style dictionaries, tuple lists, and count-only answers.exact_string_matchandnumeric_matchare skeleton only in this branch. - LLM-As-Judge —
llm_judge. Six-criterion rubric, requires a LiteLLM-routable model passed via--judge-model. - Semantic-Score — similarity-based, no LLM call. Skeleton only in this branch.
- Code-Based — deterministic, no LLM.
- Evaluator — orchestrates a batch: loads scenarios + trajectories,
joins on
scenario_id, dispatches to scorers, aggregates results.
JSON list, JSON object, or JSONL. Fields the scorer cares about:
| Field | Used by | Notes |
|---|---|---|
id |
join | Coerced to string at load time |
text |
all | The utterance the agent answered |
type |
reporting | Scenario family (iot, tsfm, FMSR, …) |
characteristic_form |
llm_judge, semantic_similarity* |
Expected behaviour, free-form |
expected_answer |
static_json, exact_string_match, numeric_match |
Structured answer, exact target string, or number |
scoring_method |
dispatch | Registered scorer name; overrides CLI default |
tolerance |
numeric_match* |
Optional relative + absolute tolerance |
* Skeleton in this branch — see Available scorers.
Ground-truth files under groundtruth/ already match this schema —
they're a drop-in scenarios input.
Written by the observability layer to AGENT_TRAJECTORY_DIR as one
JSON per run. Fields the evaluator reads:
{
"run_id": "<uuid>",
"scenario_id": "<scenario id>",
"runner": "claude-agent" | "plan-execute" | …,
"model": "<provider/model>",
"question": "<utterance>",
"answer": "<final response>",
"trajectory": {…} // SDK Trajectory dict, or list[StepResult] for plan-execute
}
scenario_id is the primary join key. If scenario_id is missing or null, the loader may fall back to the trajectory filename stem. For generated trajectories where scenario_id contains a descriptive label, the evaluator can also fall back to run_id when it matches the scenario id.
# 1. Persist trajectories under AGENT_TRAJECTORY_DIR
export AGENT_TRAJECTORY_DIR=$(pwd)/traces/trajectories
uv run claude-agent "List all failure modes of asset Chiller." --scenario-id 101
# 2. Score with LLM-As-Judge against the ground-truth file
uv run evaluate \
--trajectories traces/trajectories \
--scenarios groundtruth/101.json \
--scorer-default llm_judge \
--judge-model litellm_proxy/azure/gpt-5.4Output:
Scenarios: 1 Passed: 1 Pass rate: 100.0%
By scenario type:
FMSR 1/1 (100.0%)
Operational metrics:
tokens_in_total: 7
tokens_out_total: 25
tool_calls_total: 1
duration_ms_p50: 14690.6
Reports written: reports/<run_id>.json (1 files)
Aggregate: reports/_aggregate.json
reports/
├── <run_id>.json # one ScenarioResult per trajectory
├── <run_id>.json
└── _aggregate.json # EvalReport: totals, by_scenario_type, ops rollup
Per-run file (reports/<run_id>.json):
{
"scenario_id": "101",
"scenario_type": "FMSR",
"run_id": "112c1b56-…",
"runner": "claude-agent",
"model": "litellm_proxy/aws/claude-opus-4-6",
"question": "List all failure modes of asset Chiller.",
"answer": "Here are the 7 failure modes for the Chiller asset: …",
"score": {
"scorer": "llm_judge",
"passed": true,
"score": 1.0,
"rationale": "",
"details": {
"task_completion": true,
"data_retrieval_accuracy": true,
"generalized_result_verification": true,
"agent_sequence_correct": true,
"clarity_and_justification": true,
"hallucinations": false,
"suggestions": ""
}
},
"ops": {
"turn_count": 2,
"tool_call_count": 1,
"unique_tools": ["get_failure_modes"],
"tokens_in": 7,
"tokens_out": 25,
"duration_ms": 14690.6,
"est_cost_usd": 0.001959
}
}Aggregate (reports/_aggregate.json) is the full EvalReport:
{
"generated_at": "<iso8601>",
"runners": ["claude-agent"],
"models": ["litellm_proxy/aws/claude-opus-4-6"],
"totals": {
"scenarios": 1,
"scored": 1,
"passed": 1,
"pass_rate": 1.0
},
"by_scenario_type": {
"FMSR": {"total": 1, "passed": 1, "pass_rate": 1.0}
},
"ops": {
"tokens_in_total": 7,
"tokens_out_total": 25,
"tool_calls_total": 1,
"duration_ms_p50": 14690.6,
"duration_ms_p95": 14690.6,
"est_cost_usd_total": 0.001959
},
"results": [ /* one ScenarioResult per run, same shape as the per-run files */ ]
}uv run evaluate \
--trajectories DIR_OR_FILE # required
--scenarios FILE [FILE ...] # required, one or more
[--reports-dir DIR] # default: reports/
[--scorer-default NAME] # default: llm_judge
[--judge-model MODEL_ID] # required when llm_judge runs
[-v]
| Family | Registered name | Status |
|---|---|---|
| LLM-As-Judge | llm_judge |
Works. Installed by passing --judge-model |
| Code-Based | static_json |
Works. Deterministic scorer for structured answers |
| Code-Based | exact_string_match |
Skeleton — NotImplementedError |
| Code-Based | numeric_match |
Skeleton — NotImplementedError |
| Semantic-Score | semantic_similarity |
Skeleton — NotImplementedError |
static_json is registered automatically and can be selected with --scorer-default static_json. It compares the final answer from a saved trajectory against scenario.expected_answer, parsing JSON objects, arrays, Python-style literals, tuple lists, nested structures, markdown-fenced answers, answer-prefixed outputs, and count-only answers. See Static JSON Evaluation for details.
Skeleton scorers don't auto-register; calling them raises
NotImplementedError. Fill in the body and call
evaluation.scorers.register("<name>", <fn>) to enable.
Six-criterion rubric, prompt mirrored from
src/tmp/evaluation_agent/result_evaluation_prompt.py:
task_completiondata_retrieval_accuracygeneralized_result_verificationagent_sequence_correctclarity_and_justificationhallucinations
A run passes overall iff the first five are true and
hallucinations is false. The score is the fraction of the first
five satisfied, minus 0.2 if hallucinations is true. The judge's
free-form suggestions (or legacy reason) lands in
score.rationale; the full review dict lands in score.details.
To customise: edit _PROMPT_TEMPLATE in
src/evaluation/scorers/llm_judge.py.
When llm_judge is active and --judge-model is provided, evaluation
aborts any row where the trajectory model matches the judge model
(after normalizing the litellm_proxy/ prefix). This avoids
out-of-the-box self-evaluation bias.
Example error:
self-judging is not allowed for llm_judge: trajectory model 'litellm_proxy/aws/claude-opus-4-6' matches judge model 'litellm_proxy/aws/claude-opus-4-6'
from pathlib import Path
from evaluation import Evaluator
from evaluation.scorers.llm_judge import install
from llm import LiteLLMBackend
install(LiteLLMBackend(model_id="litellm_proxy/aws/claude-opus-4-6"))
report = Evaluator(default_scorer="llm_judge").evaluate(
trajectories_path=Path("traces/trajectories"),
scenarios_paths=[Path("groundtruth/101.json")],
)
for r in report.results:
print(r.run_id, r.score.passed, r.score.score)For static JSON scoring, no judge model installation is required:
from pathlib import Path
from evaluation import Evaluator
report = Evaluator(default_scorer="static_json").evaluate(
To enforce the self-judging guard in programmatic usage, pass `judge_model`:
```python
report = Evaluator(
default_scorer="llm_judge",
judge_model="litellm_proxy/azure/gpt-5.4",
).evaluate(
trajectories_path=Path("traces/trajectories"),
scenarios_paths=[Path("groundtruth/101.json")],
)static_json is a deterministic scorer for structured outputs. It is useful when the expected answer is a JSON object, JSON array, Python-style dictionary, tuple list, nested structure, or count-only answer.
The scorer compares scenario.expected_answer against the trajectory final answer. It parses noisy structured outputs, normalizes scalar values, flattens nested structures into key paths, and reports strict exact match, partial exact match, partial similarity, precision, recall, F1, missing keys, extra keys, and key-level details.
A run passes only when the full structured answer is an exact match. Partial correctness is still available through the score and score.details.
To score structured answers with the deterministic static JSON scorer:
uv run evaluate \
--trajectories traces/trajectories \
--scenarios groundtruth/101.json \
--scorer-default static_jsonFor folder-based generated scenarios with scenario_<id>/groundtruth.txt:
uv run evaluate \
--trajectories traces/trajectories/direct_llm \
--scenarios /path/to/scenarios_data \
--scorer-default static_json \
--reports-dir reports/static_json_direct_llmDetailed documentation: Static JSON Evaluation.
from evaluation import scorers
from evaluation.models import ScorerResult
def keyword_hit(scenario, answer, trajectory_text) -> ScorerResult:
required = (scenario.model_extra or {}).get("required_keywords", [])
hits = [k for k in required if k.lower() in answer.lower()]
passed = len(hits) == len(required)
return ScorerResult(
scorer="keyword_hit",
passed=passed,
score=len(hits) / max(1, len(required)),
rationale="" if passed else f"missing: {set(required) - set(hits)}",
)
scorers.register("keyword_hit", keyword_hit)
# Any scenario with "scoring_method": "keyword_hit" now routes here.export AGENT_TRAJECTORY_DIR=$(pwd)/traces/trajectories
for f in groundtruth/*.json; do
utt=$(python3 -c "import json,sys;d=json.load(open(sys.argv[1]));print(d['text'])" "$f")
sid=$(python3 -c "import json,sys;d=json.load(open(sys.argv[1]));print(d['id'])" "$f")
uv run claude-agent "$utt" --scenario-id "$sid"
done
uv run evaluate \
--trajectories traces/trajectories \
--scenarios groundtruth/*.json \
--scorer-default llm_judge \
--judge-model litellm_proxy/azure/gpt-5.4