Skip to content

Phase 6: re-baseline or investigate the 250ms bench startup-time soft-skip #13

@thomaschristory

Description

@thomaschristory

The startup benchmark's internal threshold has been emitting a soft pytest.skip("OVER THRESHOLD") on every sub-phase since 4c. That's now seven consecutive releases (4c, 4d, 5a, 5b, 5c, the v1.0.0 fix, v1.0.0 final) where the threshold has never been met but the bench is treated as green via the soft-skip.

Current state (HEAD = f323c78)

  • tests/benchmarks/test_startup.py:18THRESHOLD_SECONDS = 0.25 (250ms).
  • Recent bench medians (from session context):
    • Phase 5a: 262ms
    • Phase 5c: 267ms
    • 5d T2 (this session): 263.7ms — runs 302.8 / 263.7 / 256.7
  • Project-level bench target (CHANGELOG / spec): 300ms median.

So the bench has been within the 300ms project target the entire time, but never under the 250ms internal aspirational threshold. The result: the OVER THRESHOLD skip has become folklore — a soft-warning that no longer carries information because everyone expects it to fire.

Two paths

Path A — re-baseline (cheap, recommended):
Bump THRESHOLD_SECONDS to a value the current implementation can actually meet (e.g. 0.28). The new threshold becomes a real regression detector: if a future change pushes startup over 280ms, the bench fires for a real reason.

Path B — investigate (more work, more value):
Profile cold startup and find the import that grew the threshold from "achievable in some prior phase" to "never achievable". Possible suspects: lazy-import unwinding in 4d's audit-redaction work, the schemas module loading the bundled OpenAPI manifest at import time, the new skill subpackage. Outcome: either the threshold gets met again (skip becomes meaningful), or the investigation surfaces a real performance bug worth fixing.

The 5d retro flagged that path A defaults the regression detector to its current state and accepts that as the new normal; path B preserves the original aspiration but costs investigation time.

Why deferred from v1

The 300ms project target has been met every release. The 250ms threshold is internal aspirational signal that has been broken for seven sub-phases — neither failing nor informative.

Acceptance

  • Either: THRESHOLD_SECONDS bumped and a new bench run confirms the bench passes (not skips) against the new threshold.
  • Or: a profiling write-up identifies the root cause; either it's fixable (PR + restored 250ms threshold) or it's not (file follow-up + bump threshold).

Either way, OVER THRESHOLD should stop being a permanent fixture of bench output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions