fix(cloud-eval): extract scores from real Foundry azure_ai_evaluator result shape#195
Merged
Merged
Conversation
…result shape
The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass.
Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score.
Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact.
Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows.
Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
placerda
added a commit
that referenced
this pull request
May 29, 2026
…result shape (#195) (#196) The cloud-eval parser was returning value=null for every metric in real Foundry runs even when graders completed successfully, causing the PR / deploy gate to fire 'Threshold status: FAILED' with all thresholds showing actual=missing on the very first tutorial pass. Root cause: _metric_from_result only probed {score|value|result|passed} at the top level. The real azure_ai_evaluator shape (verified against Azure/azure-sdk-for-python fixture evaluation_util_convert_expected_output.json) emits {type, name, metric, score, label, reason, threshold, passed, sample, status}, and some custom prompt-based graders nest the score under sample.score / details.score. Fix: widen the probe to (score, value, result, metric_value, rating, grader_score, numeric_value), then passed (bool), then label ('pass'/'fail'), then descend into sample/details. Treat score: 0 as a legitimate value (was being lost). When still nothing found, record a structured error pointing at the new raw-items artifact. Also: always persist the raw Foundry output_items as cloud_output_items.json next to results.json so future parser regressions are debuggable from the artifact bundle alone, and emit an explicit progress warning when a cloud run yields zero usable scores despite returning rows. Tests: +5 new tests covering the real Foundry shape, score=0 boundary, label-only fallback, nested sample.score, and the diagnostic error path. Full suite: 789 passed, 3 skipped. (cherry picked from commit 0fe6b00) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Tutorial users hit "Threshold status: FAILED · exit code 2" on their first pass with every metric showing
actual=missinginreport.md, even though Foundry's cloud eval ran to completion (rows graded, statuscompleted, realeval_id). The PR was real — the parser was the bug.That breaks the tutorial's only contract: "first pass should be green" so the user understands what green looks like before they're asked to break it.
Root cause
pipeline/cloud_results._metric_from_resultonly probed{score, value, result, passed}at the top level of eachoutput_item.results[i]entry. The real on-the-wire shape emitted by Foundry'sazure_ai_evaluatorgrader carries more keys and the score isn't always at the top:Custom prompt-based graders nest the score under
sample.scoreordetails.score. Binary content-safety graders sometimes emit onlylabel: "pass"/label: "fail"with no numeric score at all.End result: every metric in a real Foundry cloud run came back
value: null,aggregate_metrics: {}was empty, threshold table showedactual=missingfor every row, and CI fired exit 2. Reproduced against PO's real failed PR run #26616081824 (eval_f786c180b1304c6a8926717cd78256f4, agenttravel-agent:2, 3 rows, statuscompleted).What changes
Parser (
src/agentops/pipeline/cloud_results.py)score,value,result,metric_value,rating,grader_score,numeric_value.passed(bool) → 1.0 / 0.0 as before, pluslabel("pass" / "fail" / etc.) → 1.0 / 0.0 for graders that only emit labels.sampleanddetailsas a final fallback for custom prompt graders.score: 0treated as a legitimate value (was being conflated with None in some code paths).null.Observability (
src/agentops/pipeline/orchestrator.py)output_itemsas<output_dir>/cloud_output_items.jsonalongsideresults.json/report.md. Future parser regressions are now debuggable from the artifact bundle alone.Tests (
tests/unit/test_cloud_results.py)+5 new tests:
test_extracts_score_zero_as_legitimate_value—score: 0boundary.test_extracts_real_foundry_azure_ai_evaluator_result_shape— the full realazure_ai_evaluatorenvelope withtype,metric,label,threshold,passed,sample,status.test_extracts_score_nested_in_sample_when_top_level_missing— fallback intosample.test_extracts_score_from_label_when_no_numeric_score—label: "pass"/label: "fail"mapping.test_records_diagnostic_reason_when_score_is_missing— diagnostic error path.Validation
No new dependencies. Tests still run without Azure credentials.
Real impact size
(The big numbers reported by
git diff --statare CRLF line-ending normalization from Windows; the real change is+239/-7. Verify withgit diff --ignore-cr-at-eol --stat origin/develop..HEAD.)Backward compatibility
test_extracts_metric_scores_from_results— covering the{score, value, passed}shape — unchanged and green).results.jsonandreport.mdstill have their stable shape; only the values change (numbers now populated where they werenull).cloud_output_items.jsonis a new artifact, not a renamed one — does not collide with any existing artifact in.agentops/results/<timestamp>/.Follow-ups (out of scope)
mainas a separate small PR (same flow as the PyYAML hotfix earlier this week).