fix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape by placerda · Pull Request #195 · Azure/agentops

placerda · 2026-05-29T04:08:48Z

Why

Tutorial users hit "Threshold status: FAILED · exit code 2" on their first pass with every metric showing actual=missing in report.md, even though Foundry's cloud eval ran to completion (rows graded, status completed, real eval_id). The PR was real — the parser was the bug.

That breaks the tutorial's only contract: "first pass should be green" so the user understands what green looks like before they're asked to break it.

Root cause

pipeline/cloud_results._metric_from_result only probed {score, value, result, passed} at the top level of each output_item.results[i] entry. The real on-the-wire shape emitted by Foundry's azure_ai_evaluator grader carries more keys and the score isn't always at the top:

// Verified against Azure/azure-sdk-for-python evaluation fixture
// evaluation_util_convert_expected_output.json
{
  "type": "azure_ai_evaluator",
  "name": "violence",
  "metric": "violence",
  "score": 0,              // ← legitimate value; old code returned None on `score: 0` in some paths
  "label": "pass",
  "reason": "...",
  "threshold": 3,
  "passed": true,
  "sample": {...},         // ← some prompt-based graders nest score here
  "status": "completed"
}

Custom prompt-based graders nest the score under sample.score or details.score. Binary content-safety graders sometimes emit only label: "pass" / label: "fail" with no numeric score at all.

End result: every metric in a real Foundry cloud run came back value: null, aggregate_metrics: {} was empty, threshold table showed actual=missing for every row, and CI fired exit 2. Reproduced against PO's real failed PR run #26616081824 (eval_f786c180b1304c6a8926717cd78256f4, agent travel-agent:2, 3 rows, status completed).

What changes

Parser (`src/agentops/pipeline/cloud_results.py`)

Wider score probe. Top-level keys searched: score, value, result, metric_value, rating, grader_score, numeric_value.
passed (bool) → 1.0 / 0.0 as before, plus
label ("pass" / "fail" / etc.) → 1.0 / 0.0 for graders that only emit labels.
Descend into sample and details as a final fallback for custom prompt graders.
score: 0 treated as a legitimate value (was being conflated with None in some code paths).
When still nothing found, record a structured error pointing at the new raw-items artifact instead of silently writing null.

Observability (`src/agentops/pipeline/orchestrator.py`)

Always persist raw output_items as <output_dir>/cloud_output_items.json alongside results.json / report.md. Future parser regressions are now debuggable from the artifact bundle alone.
Emit an explicit progress warning when a cloud run returns rows but yields zero usable metric scores. The warning names the artifact path.

Tests (`tests/unit/test_cloud_results.py`)

+5 new tests:

test_extracts_score_zero_as_legitimate_value — score: 0 boundary.
test_extracts_real_foundry_azure_ai_evaluator_result_shape — the full real azure_ai_evaluator envelope with type, metric, label, threshold, passed, sample, status.
test_extracts_score_nested_in_sample_when_top_level_missing — fallback into sample.
test_extracts_score_from_label_when_no_numeric_score — label: "pass" / label: "fail" mapping.
test_records_diagnostic_reason_when_score_is_missing — diagnostic error path.

Validation

python -m pytest tests/unit/test_cloud_results.py -x -q    # 11 passed
python -m pytest tests/ -x -q                              # 789 passed, 3 skipped

No new dependencies. Tests still run without Azure credentials.

Real impact size

CHANGELOG.md                           |  25 ++++++++
src/agentops/pipeline/cloud_results.py |  79 +++++++++++++++++++++--
src/agentops/pipeline/orchestrator.py  |  29 +++++++++
tests/unit/test_cloud_results.py       | 113 +++++++++++++++++++++++++++++++++
4 files changed, 239 insertions(+), 7 deletions(-)

(The big numbers reported by git diff --stat are CRLF line-ending normalization from Windows; the real change is +239/-7. Verify with git diff --ignore-cr-at-eol --stat origin/develop..HEAD.)

Backward compatibility

Existing tests still pass (test_extracts_metric_scores_from_results — covering the {score, value, passed} shape — unchanged and green).
No CLI / output schema changes. results.json and report.md still have their stable shape; only the values change (numbers now populated where they were null).
No new flags. Behavior is strictly an improvement to the parser path that was already running.
cloud_output_items.json is a new artifact, not a renamed one — does not collide with any existing artifact in .agentops/results/<timestamp>/.

Follow-ups (out of scope)

Will cherry-pick to main as a separate small PR (same flow as the PyYAML hotfix earlier this week).

…result shape The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass. Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score. Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact. Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows. Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…result shape (#195) (#196) The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass. Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score. Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact. Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows. Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped. (cherry picked from commit 0fe6b00) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

placerda merged commit 0fe6b00 into develop May 29, 2026
12 checks passed

placerda deleted the feature/cloud-results-null-scores-fix branch May 29, 2026 04:12

placerda mentioned this pull request May 29, 2026

hotfix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape (cherry-pick #195) #196

Merged

placerda mentioned this pull request May 29, 2026

fix(cloud-eval): lift grader execution errors into RowMetric.error #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape#195

fix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape#195
placerda merged 1 commit into
developfrom
feature/cloud-results-null-scores-fix

placerda commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

placerda commented May 29, 2026

Why

Root cause

What changes

Parser (src/agentops/pipeline/cloud_results.py)

Observability (src/agentops/pipeline/orchestrator.py)

Tests (tests/unit/test_cloud_results.py)

Validation

Real impact size

Backward compatibility

Follow-ups (out of scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parser (`src/agentops/pipeline/cloud_results.py`)

Observability (`src/agentops/pipeline/orchestrator.py`)

Tests (`tests/unit/test_cloud_results.py`)