Source
Discovered during E2E validation of #78.
Problem
bundles/conversational_agent_baseline.yaml ships with:
- evaluator: avg_latency_seconds
criteria: "<="
value: 10.0
But Foundry cloud evaluation latency includes the entire judge-evaluator pipeline. Real values measured against FoundryAgent on gpt-5.1:
- Run 1: 22.47s/row
- Run 2: 14.04s/row
Anyone who follows the Foundry tutorial and runs the smoke dataset against a real Foundry agent will hit immediate FAIL on the latency rule, even though scoring is fine across all 4 quality evaluators.
Options
A. Raise the default for Foundry-targeted bundles to e.g. 30s. Same value can be too lax for HTTP/local backends though.
B. Split into two metrics:
agent_latency_seconds — only the agent invocation time
eval_latency_seconds — full pipeline including evaluators
…and apply tighter thresholds only to the agent metric.
C. Document prominently in the Foundry tutorial that avg_latency_seconds measures the full pipeline and users should tune the threshold per backend.
Recommendation: C as a stop-gap, B as the proper fix (also addresses #ISSUE_FAKED_LATENCY).
Severity
Medium (every Foundry tutorial run currently fails on this rule)
Source
Discovered during E2E validation of #78.
Problem
bundles/conversational_agent_baseline.yamlships with:But Foundry cloud evaluation latency includes the entire judge-evaluator pipeline. Real values measured against
FoundryAgentongpt-5.1:Anyone who follows the Foundry tutorial and runs the smoke dataset against a real Foundry agent will hit immediate FAIL on the latency rule, even though scoring is fine across all 4 quality evaluators.
Options
A. Raise the default for Foundry-targeted bundles to e.g. 30s. Same value can be too lax for HTTP/local backends though.
B. Split into two metrics:
agent_latency_seconds— only the agent invocation timeeval_latency_seconds— full pipeline including evaluators…and apply tighter thresholds only to the agent metric.
C. Document prominently in the Foundry tutorial that
avg_latency_secondsmeasures the full pipeline and users should tune the threshold per backend.Recommendation: C as a stop-gap, B as the proper fix (also addresses #ISSUE_FAKED_LATENCY).
Severity
Medium (every Foundry tutorial run currently fails on this rule)