Skip to content

fix(dashboard): raise p95 sample threshold to 10 turns with p50 fallback#96

Open
nicolotognoni wants to merge 1 commit into
feat/observability-otel-attrs-0.6.1from
fix/0.6.1-dashboard-p95-sample-threshold
Open

fix(dashboard): raise p95 sample threshold to 10 turns with p50 fallback#96
nicolotognoni wants to merge 1 commit into
feat/observability-otel-attrs-0.6.1from
fix/0.6.1-dashboard-p95-sample-threshold

Conversation

@nicolotognoni
Copy link
Copy Markdown
Collaborator

Summary

  • PR-feat(observability): emit patter.cost.* and patter.latency.* OTel span attributes #82's lowered threshold (5 → 2 turns) gave parity with the call-list column but produced a misleading headline on short calls. Live test: n=5 turns → p95=1977ms while p50=309ms, because at n<10 the 95th-percentile collapses to "slowest single turn" and stops being a tail estimate.
  • Raised threshold back to 10 across every dashboard surface that exposes a percentile, and added a p50 fallback so short calls still show a useful number (just labelled honestly).
  • App-level "Avg latency p95" card now also gates on ≥3 qualifying calls, otherwise it renders "—" instead of an average that's effectively dominated by whichever short call happens to be in the bucket.

Implementation

  • MIN_TURNS_FOR_PERCENTILES = 10 exported from LatencyPanel.tsx and MetricsPanel.tsx; MIN_TURNS_FOR_P95_COLUMN = 10 from CallTable.tsx; MIN_TURNS_FOR_AVG_P95 = 10 + MIN_CALLS_FOR_AVG_P95 = 3 from App.tsx. Single source per surface so the threshold can't drift.
  • Detail pane (LatencyPanel, both latency view branches of MetricsPanel): below the threshold the p95 box renders p50 (n<10) instead, with a tooltip and footer line spelling out the rule. Realtime calls (single-bucket waterfall) and pipeline calls (stt/llm/tts waterfall) both covered.
  • Call list (CallTable.tsx): column renamed "p95 latency" → "Latency" (since it now reports either statistic); rows with turnCount < 10 show <ms> (p50); column header has a tooltip explaining the fallback.
  • Sparkline tooltip (Metric.tsx bucketHeadline): when no call in the bucket has ≥10 turns, the headline reads AVG LATENCY n/a (n<10 turns) rather than a fake "0 ms".
  • App headline card (App.tsx): avgP95() filters out calls with <10 turns, requires ≥3 qualifying calls, returns 0 otherwise. The card then renders "—" instead of "0 ms" so the empty state is unambiguous.
  • New test file dashboard-app/src/App.test.ts (8 cases) covers avgP95 gating and bucketHeadline fallback.
  • Re-built the SPA (vite build) and re-synced the inlined bundle into libraries/{typescript,python}/.../dashboard/ui.html via dashboard-app/scripts/sync.mjs so both SDKs ship the updated UI.

Breaking change?

No. The thresholds are internal to the dashboard SPA; SDK API surface is untouched. Users with long calls (≥10 turns) see no change; users with short calls now see honest p50 numbers instead of a noisy p95.

Test plan

  • dashboard-app: npm test -- --run → 16/16 pass (8 new + 8 existing)
  • dashboard-app: npm run lint (tsc --noEmit) clean
  • dashboard-app: npm run build succeeds, 208 kB bundle synced
  • libraries/typescript: npm test -- --run → 1516/1516 pass
  • libraries/typescript: npm run lint clean
  • Manual: open dashboard on a short (n<10) call and verify the detail pane shows p50 (n<10) with a tooltip
  • Manual: open dashboard on a long (n≥10) call and verify p95 surfaces as before

Docs updates

  • N/A — internal dashboard rendering change. CHANGELOG.md entry added under ## 0.6.1 (2026-05-12).

PR-#82's 5 → 2 turn lowering reunited the per-call detail pane with
the call-list column, but on a live n=5 turn call the headline became
"p95=1977ms" while p50 was only 309ms — the 95th percentile collapses
to the slowest single sample at low n and stops being a tail estimate.

Threshold is now 10 turns everywhere (LatencyPanel, MetricsPanel,
CallTable, Metric tooltip, App "Avg latency p95" card). Below the
threshold every surface falls back to p50 — robust to a single
outlier — and labels the cell so the user knows why. App-level
"Avg latency p95" additionally requires ≥3 qualifying calls before
showing a number; otherwise the card renders "—" instead of a
polluted average.

The four exported MIN_TURNS_* constants are kept in lockstep so the
threshold is single-sourced. Bundle re-synced to both SDKs via
dashboard-app/scripts/sync.mjs.

New tests: src/App.test.ts (8) covers avgP95 gating + bucketHeadline
fallback. dashboard-app vitest: 16/16 pass. libraries/typescript
vitest: 1516/1516 pass. tsc --noEmit clean on both packages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant