Skip to content

Default avg_latency_seconds threshold (10s) is unrealistic for Foundry cloud eval #116

@Dongbumlee

Description

@Dongbumlee

Source

Discovered during E2E validation of #78.

Problem

bundles/conversational_agent_baseline.yaml ships with:

- evaluator: avg_latency_seconds
  criteria: "<="
  value: 10.0

But Foundry cloud evaluation latency includes the entire judge-evaluator pipeline. Real values measured against FoundryAgent on gpt-5.1:

  • Run 1: 22.47s/row
  • Run 2: 14.04s/row
    Anyone who follows the Foundry tutorial and runs the smoke dataset against a real Foundry agent will hit immediate FAIL on the latency rule, even though scoring is fine across all 4 quality evaluators.

Options

A. Raise the default for Foundry-targeted bundles to e.g. 30s. Same value can be too lax for HTTP/local backends though.

B. Split into two metrics:

  • agent_latency_seconds — only the agent invocation time
  • eval_latency_seconds — full pipeline including evaluators
    …and apply tighter thresholds only to the agent metric.

C. Document prominently in the Foundry tutorial that avg_latency_seconds measures the full pipeline and users should tune the threshold per backend.

Recommendation: C as a stop-gap, B as the proper fix (also addresses #ISSUE_FAKED_LATENCY).

Severity

Medium (every Foundry tutorial run currently fails on this rule)

Metadata

Metadata

Assignees

No one assigned

    Labels

    sprint3Sprint 3 test scenarios

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions