Skip to content

[Epic]: Performance-phase constraints: empirically-grounded thresholds for testbed-blocked overlays #1043

Description

@mchmarny

Goal

Every overlay shipped by AICR must carry a performance-phase constraint anchored in empirically-grounded thresholds from a real testbed, so aicr validate ships a meaningful runtime gate instead of a placeholder. Overlays without a constraint either get one or are explicitly exempted in recipes/overlays_validation_floor_test.go with the tracking issue.

Parent initiative: #1041.

Success criteria

  1. Every entry in recipes/overlays_validation_floor_test.go flagged by AICR_VALIDATION_FLOOR_STRICT=1 either has the required constraint or carries an exemption pointing at the issue tracking the testbed.
  2. Each added constraint uses thresholds derived from a real run on the corresponding service (no placeholder values).
  3. The validator the constraint invokes (e.g., inference-perf, nccl-all-reduce-bw) actually executes against the workload variant the overlay deploys — if it doesn't, the gap is filed under the validator capability Epic instead.

Scope

In scope:

  • Adding nccl-* / inference-perf / future performance constraints to existing strict-floor-flagged overlays once their testbed lands.
  • Re-running the threshold derivation when a service or accelerator generation changes.

Out of scope (other epics under #1041):

  • New overlays themselves (covered under the overlay-coverage Epic).
  • Validator changes that would make an existing skip path actually run on a NIM / new workload variant (validator capability Epic).

New testbed-blocked performance gaps should be filed as standalone issues and attached here.

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Epic.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions