Static JSON Evaluation

static_json is a deterministic scorer for structured AssetOpsBench answers. It is intended for scenarios whose expected answers are structured outputs, including JSON objects, JSON arrays, Python-style dictionaries, tuple lists, nested JSON, and count-only answers.

This scorer is adapted from the DeepSynth-style static evaluation design: it parses structured outputs, normalizes values, flattens nested structures, compares key-value pairs, computes exact and partial metrics, and reports missing/extra keys. In AssetOpsBench, it is integrated into the existing trajectory-based evaluation pipeline.

Evaluation Workflow

static_json follows the same high-level workflow as the LLM-as-Judge evaluator:

agent run
→ saved trajectory
→ extract final answer and scenario_id
→ load ground truth
→ run scorer
→ aggregate benchmark-level report

The scorer is invoked through the existing evaluate command:

uv run evaluate \
  --trajectories traces/trajectories \
  --scenarios /path/to/scenarios_data \
  --scorer-default static_json \
  --reports-dir reports/static_json_eval

Expected Inputs

Trajectories

The evaluator reads saved trajectory JSON files from the path passed to --trajectories.

Example trajectory:

{
  "run_id": "11",
  "scenario_id": "11",
  "runner": "direct-llm-agent",
  "model": "litellm_proxy/Azure/gpt-5-mini-2025-08-07",
  "question": "Check the storage-related maintenance job records...",
  "answer": "{\"energy\":14,\"material\":27}",
  "trajectory": {
    "turns": []
  }
}

The model's final answer is read from the trajectory's answer field.

If scenario_id is missing or null, the evaluator can fall back to the trajectory filename stem. For example:

traces/trajectories/34.json → scenario_id = "34"

For some generated trajectories, scenario_id may contain a descriptive label while run_id contains the numeric scenario id. In that case, the loader can use run_id as a fallback join key.

Ground Truth Layout

The evaluator can load ground truth from scenario folders.

Expected folder layout:

scenarios_data/
  scenario_11/
    groundtruth.txt
  scenario_12/
    groundtruth.txt
  scenario_13/
    groundtruth.txt

Example groundtruth.txt:

{'energy': 14, 'material': 48}

The folder name is converted into the scenario id:

scenario_11 → 11

The evaluator then joins trajectories and ground truth by scenario id.

How Matching Works

The evaluator loads all trajectories and all ground-truth scenario folders, then joins them.

Example:

trajectory: traces/trajectories/direct_llm/11.json
run_id: 11
scenario_id: 11
answer: {"energy":14,"material":27}

ground truth: scenarios_data/scenario_11/groundtruth.txt
expected answer: {'energy': 14, 'material': 48}

The evaluator matches the trajectory to the ground truth using the available join keys, then compares:

model answer vs. ground-truth answer

Only matched trajectory/scenario pairs are evaluated. If a trajectory has no matching ground-truth scenario, it is skipped. If a ground-truth scenario has no matching trajectory, it is skipped.

What `static_json` Handles

The scorer can parse common structured model outputs, including:

JSON object

{"energy": 14, "material": 48}

JSON array

[
  {"equipment_group": "Pumps/fans/generators", "count": 21},
  {"equipment_group": "Lines & drives", "count": 15}
]

Python-style dictionary

{'energy': 14, 'material': 48}

Python-style list of tuples

[("Engines & motors", 5), ("Lines & drives", 2)]

Markdown fenced answer

```json
{"repair": 13, "replace": 0}
```

Answer-prefixed output

Final Answer: {"repair": 13, "replace": 0}

Count-only answer

The scorer also handles simple noisy count-only answers such as:

The answer is 34.

Metrics

static_json reports both aggregate metrics and key-level details.

Metric	Meaning
`strict_exact_match_accuracy`	1.0 only if the full flattened gold answer exactly matches the flattened model answer
`partial_exact_match_accuracy`	Fraction of gold keys whose values exactly match
`partial_similarity_score`	Average similarity score across gold keys, including numeric closeness
`precision`	Exact matches divided by total model keys
`recall`	Exact matches divided by total gold keys
`f1`	Harmonic mean of precision and recall
`missing_keys`	Keys present in gold but missing from model output
`extra_keys`	Keys present in model output but not in gold
`details`	Per-key comparison records

The scorer marks a scenario as passed only when the structured answer is a strict exact match. Partial correctness is still available through score, f1, partial_exact_match_accuracy, and detailed key-level results.

Example

Ground truth:

{'energy': 14, 'material': 48}

Model answer:

{"energy":14,"material":27}

The scorer flattens both answers:

gold:
answer.energy = 14
answer.material = 48

model:
answer.energy = 14
answer.material = 27

Then it compares each key:

answer.energy: expected 14, got 14 → exact
answer.material: expected 48, got 27 → mismatch

Example score:

{
  "passed": false,
  "score": 0.5,
  "details": {
    "partial_exact_match_accuracy": 0.5,
    "strict_exact_match_accuracy": 0.0,
    "partial_similarity_score": 0.5,
    "precision": 0.5,
    "recall": 0.5,
    "f1": 0.5,
    "total_gold_keys": 2,
    "total_model_keys": 2,
    "matched_keys": 2,
    "exact_value_matches": 1,
    "missing_keys": [],
    "extra_keys": []
  }
}

Reports

The evaluator writes per-run reports and an aggregate report to the directory passed with --reports-dir.

Example:

uv run evaluate \
  --trajectories traces/trajectories/direct_llm \
  --scenarios /path/to/scenarios_data \
  --scorer-default static_json \
  --reports-dir reports/static_json_direct_llm

Output:

reports/static_json_direct_llm/
  11.json
  12.json
  ...
  _aggregate.json

Each per-run report contains the scenario id, run id, model answer, score, key-level comparison details, and operational metrics.

The aggregate report summarizes the number of scored scenarios, pass rate, runner/model names, and operational metrics.

Recommended Usage

Use llm_judge for:

natural-language reasoning answers
explanations
open-ended diagnostic responses
answers with many acceptable phrasings

Use static_json for:

JSON objects
JSON arrays
Python-style dictionaries
tuple lists
nested structured outputs
count-only answers
benchmark scenarios where key/value correctness matters

Running Tests

Run the static JSON scorer tests:

uv run pytest src/evaluation/tests/test_static_json_scorer.py -v

Run loader tests for trajectory/scenario matching:

uv run pytest src/evaluation/tests/test_loader.py -v

Run evaluation tests:

uv run pytest src/evaluation/tests -v

Notes for Developers

The core scorer lives in:

src/evaluation/scorers/static_json.py

It is registered as:

static_json

and can be selected through:

--scorer-default static_json

The scorer itself only needs:

gold_answer
model_answer

The evaluation pipeline is responsible for:

trajectory file → final answer + scenario_id/run_id
scenario folder → groundtruth.txt
join trajectory and scenario
run scorer
write report

Acknowledgment

The static JSON scorer is inspired by DeepSynth-style structured-answer scoring. See Acknowledgments for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static JSON Evaluation

Evaluation Workflow

Expected Inputs

Trajectories

Ground Truth Layout

How Matching Works

What `static_json` Handles

JSON object

JSON array

Python-style dictionary

Python-style list of tuples

Markdown fenced answer

Answer-prefixed output

Count-only answer

Metrics

Example

Reports

Recommended Usage

Running Tests

Notes for Developers

Acknowledgment

FilesExpand file tree

static-json-evaluation.md

Latest commit

History

static-json-evaluation.md

File metadata and controls

Static JSON Evaluation

Evaluation Workflow

Expected Inputs

Trajectories

Ground Truth Layout

How Matching Works

What static_json Handles

JSON object

JSON array

Python-style dictionary

Python-style list of tuples

Markdown fenced answer

Answer-prefixed output

Count-only answer

Metrics

Example

Reports

Recommended Usage

Running Tests

Notes for Developers

Acknowledgment

What `static_json` Handles