Skip to content

feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script#112

Merged
stevei101 merged 9 commits into
developfrom
feat/ws10-metrics-pilot-endpoint
May 29, 2026
Merged

feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script#112
stevei101 merged 9 commits into
developfrom
feat/ws10-metrics-pilot-endpoint

Conversation

@principle-lgtm
Copy link
Copy Markdown
Contributor

@principle-lgtm principle-lgtm commented May 25, 2026

Closes #105.

Summary

  • GET /v1/metrics/pilot — single endpoint returning the 5 D1-backed pilot KPIs (task_completion_rate, mttr_p50_seconds, mttr_p95_seconds, context_reuse_rate, human_intervention_rate, event_throughput_per_sec) plus sample_counts. Tenant-scoped via x-tenant-id (401 if missing).
  • API latency emission + queryemit_pilot_latency writes one Analytics Engine data point per non-public request; scripts/pilot-latency.sh queries the dataset for p50/p95/p99.
  • Smoke + docsscripts/pilot-smoke.sh for reviewer / post-deploy sanity checks; full reference in docs/ws10/METRICS_ENDPOINT.md; one-liner pointer in docs/MISSION_PLAN.md.
  • Tests — 13 new in metrics::tests covering window parsing, percentile math, the MTTR state machine, CountTotal.rate edge cases, every assemble null/value branch, the genuine-zero-numerator case (must NOT be marked null), and serialized-shape invariants. 272 / 272 pass. cargo fmt and cargo clippy --lib --tests -- -D warnings clean.

Null-with-reason convention

A KPI serializes as null with an entry in a sibling null_reasons object when its denominator is zero. Zero is reserved for the genuine "numerator was actually zero" case. The null_reasons object is omitted entirely when no KPI is null.

The issue's example response didn't show null_reasons; it's added so go/no-go gates can distinguish "no data" from "0 rate". Documented in docs/ws10/METRICS_ENDPOINT.md.

Acceptance criteria checklist

  • GET /v1/metrics/pilot implemented in src/lib.rs (delegates to src/metrics.rs)
  • All six KPIs return real values, or null + reason when no data (never 0 as a stand-in)
  • Tenant-scoped — rejects requests without x-tenant-id (401)
  • Wrangler script for p50/p95/p99 from Analytics Engine (scripts/pilot-latency.sh), backed by per-request writeDataPoint emission in the Worker
  • Per-KPI unit tests (25 in metrics::tests) — pure helpers plus every assemble branch
  • One-liner in docs/MISSION_PLAN.md + full note in docs/ws10/METRICS_ENDPOINT.md

The acceptance list also mentions a "seeded D1 fixture integration test". The repo has no D1 fixture harness today and the worker::D1Database type is wasm-only, so adding one is a substantial standalone project. Filed as a WS10 follow-up; tracking in a separate issue. The assemble split makes that follow-up significantly easier when it lands — the assembly branches are already fully covered without D1.

Resolved open questions

  1. Context reuse rate — implemented as a stopgap: events_bronze rows where event_type='checkpoint_read' AND json_extract(payload, '$.hit') = 1, over total checkpoint_read rows. Docs spell out the producer contract and that this will refine before Phase 2 gates.
  2. Window aliasing24h and 1d resolve to the same window_seconds (86400). Echoed window is whatever the caller sent. Test pins this.
  3. Analytics Engine binding — the Worker now emits to it (emit_pilot_latency in src/lib.rs), so the binding is no longer dangling.

Commit walk

Commit What
795f762 Initial endpoint + module + latency-query script + docs scaffold
fdca999 Per-request writeDataPoint emission to PILOT_LATENCY
e97aac0 Split pilot into query_inputs + pure assemble; stopgap context_reuse_rate; +13 tests
c970c76 Smoke script + docs updates (stopgap definition, latency schema, smoke section)

Test plan

  • cargo test --lib — 272 pass (25 in metrics::tests)
  • cargo fmt --check clean
  • cargo clippy --lib --tests -- -D warnings clean
  • Shell scripts: bash -n clean for both scripts/pilot-smoke.sh and scripts/pilot-latency.sh
  • CF Workers Build green (wasm32 build) — local cargo check --target wasm32-unknown-unknown is broken on unmodified develop too, so wasm only runs on CI here
  • Smoke test against staging: scripts/pilot-smoke.sh --base-url https://data-fabric.stevedores.org --tenant-id <id>
  • After staging deploy, confirm scripts/pilot-latency.sh --window 1h returns non-empty rows (validates the emit path)

🤖 Generated with Claude Code

Five D1-backed KPIs (task completion, MTTR p50/p95, human intervention,
event throughput, plus deferred context-reuse) returned by a new
GET /v1/metrics/pilot. The sixth KPI (API latency) lives in Workers
Analytics Engine and is queried by scripts/pilot-latency.sh.

KPIs serialize as null with a sibling null_reasons entry when the
denominator is zero — never as 0, so pilot-week go/no-go gates don't
silently pass on missing data.

src/metrics.rs keeps the SQL-free helpers (window parsing, percentile,
MTTR delta extraction) pure-Rust so they're testable in-source; the
in-source tests cover window parsing, percentile math, and the MTTR
state machine (12 tests, all passing under cargo test --lib).

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 25, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
data-fabric-worker 479b72f Commit Preview URL May 29 2026, 01:30 PM

principle-lgtm and others added 3 commits May 25, 2026 17:36
Closes the loop on KPI #6: the fetch handler now writes one data point
per non-public request (path, method, tenant, elapsed ms, status) to
the PILOT_LATENCY binding, which scripts/pilot-latency.sh queries for
p50/p95/p99.

Best-effort emission: a missing binding (local dev) or transient sink
error swallows the failure, never turning a successful request into a
500. Public paths (/ and /health) are skipped.

Path is emitted as-is; templating high-cardinality routes (/v1/runs/:id)
is a documented follow-up.

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New CountTotal numerator/denominator type with .rate() returning
  Some(f64) when denom > 0, None otherwise.
- New PilotInputs intermediate struct holding what assemble needs.
- pilot() is now a thin compose over query_inputs (D1 IO) and assemble
  (pure). All null-vs-value branches live in assemble and have direct
  unit test coverage.

feat(ws10): implement stopgap context_reuse_rate

Definition: among events_bronze rows with event_type='checkpoint_read',
fraction with json_extract(payload, '$.hit') = 1. SQLite JSON1 returns 1
for a JSON true value. Null with reason when no checkpoint_read events
in window.

Producers don't yet contract on a "hit" field — rows without one count
as misses (conservative default). The KPI is expected to refine before
Phase 2 gates; docs/ws10/METRICS_ENDPOINT.md spells out the contract.

Adds 13 new tests (now 25 metrics tests, 272 total passing). Coverage:
CountTotal.rate edge cases, every assemble null/value branch, the
genuine zero-numerator case (must NOT be marked null), MTTR sort
robustness, serialized shape (null_reasons omitted when empty), and a
compile-time guard on query_inputs's signature.

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scripts/pilot-smoke.sh — curl + jq sanity check for reviewers and
  post-deploy. Asserts HTTP 200 + presence of every top-level key and
  every KPI key. Exit codes: 0 ok, 2 prereqs, 3 request, 4 shape.
- docs/ws10/METRICS_ENDPOINT.md — drops the "always null" cell for
  context_reuse_rate, documents the stopgap definition and the implicit
  producer contract, adds a "Smoke test" section and an "API latency:
  how it gets into Analytics Engine" section with the data-point
  schema (index/blobs/doubles).
- scripts/pilot-latency.sh — adjusts the "no rows" hint now that the
  Worker is in fact emitting (the prior text claimed it wasn't).

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@principle-lgtm principle-lgtm marked this pull request as ready for review May 25, 2026 23:08
principle-lgtm and others added 3 commits May 25, 2026 21:43
Addresses self-review on PR #112:

#1 Concurrent D1 queries. query_inputs ran 5 sequential .await calls
   (~5 × RTT). Each helper is now an async fn polled together via
   futures_util::try_join! — Workers I/O interleaves, so the hot path
   drops to ~1 × RTT.

#2 Remove no-op signature test. query_inputs_function_exists_with_
   expected_signature defined an _assert helper but never invoked it;
   `let _ = query_inputs;` asserted nothing. Comment masquerading as a
   test. Dropped.

#4 UTF-8 safe parse_window. The previous `raw.split_at(raw.len() - 1)`
   panicked on a multi-byte last char (e.g. `?window=1🌒`). Switched to
   `chars().last()` and unit char matching; added a regression test.

#5 Cap MTTR event fetch at 50,000 rows. Unbounded fetch could OOM the
   Worker for a high-volume tenant over a long window. Constant
   MTTR_EVENT_FETCH_LIMIT documents the cap; docs note that p95 above
   the cap is approximate.

#6 Broaden context_reuse predicate. json_extract returns integer 1 for
   a JSON boolean true, but producers in the wild also emit string
   "true" with various casings. Predicate now accepts (1, 'true',
   'True', 'TRUE') — matches the conservative-default ethos.

Also drops CountTotal::zero() in favour of Default::default() now that
it's no longer reachable from the now-helper-split production path.

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
emit_pilot_latency was packing tenant_id into both index1 and blob3.
index1 already exposes tenant via AE's sampling key, so blob3 was a
wasted dimension. Now carries APP_ENV (dev / staging / production),
so a shared dashboard can split traffic by env without flipping
between three separate dataset bindings.

Reads APP_ENV via env.var("APP_ENV") before the router consumes env;
falls back to "unknown" if unset (same best-effort posture as the rest
of emission).

Docstring on emit_pilot_latency now pins the schema (index1, blob1-3,
double1-2) so it stays in sync with the Analytics Engine queries.

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 docs sync: blob3 now documents APP_ENV (was tenant_id duplicate).

#5 docs caveat: MTTR row cap surfaced in the KPI table — p50/p95
   approximate above 50,000 failure+recovery events in window.

#6 docs sync: context_reuse_rate predicate updated to reflect the
   broadened SQL (`IN (1, 'true', 'True', 'TRUE')`).

#7 emission scope: "one row per non-public request" was misleading —
   only requests that pass tenant authz reach the emit path; pre-router
   401 / 403 short-circuit before emission. Doc now says so.

#8 throughput caveat: event_throughput_per_sec averages over the
   *requested* window, not the active-data window. If the tenant's
   first ingest was 3d ago and the caller asks for ?window=7d, rate
   under-reports by ~3/7. Documented in the KPI table cell.

#10 pilot-smoke.sh: URL-encode --task-type via `jq -rn ... @uri`, so
    values like "ns:foo bar" produce a valid query string instead of
    a broken URL.

Refs #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevei101
Copy link
Copy Markdown
Contributor

Self-review findings addressed in 3 commits on top of c970c76. #9 is "not actionable; noted" per the review.

# Severity Change Commit
1 🚨 query_inputs now polls all 5 D1 queries concurrently via futures_util::try_join! — drops ~5×RTT → ~1×RTT on the hot path 090e2fd
2 🚨 Removed query_inputs_function_exists_with_expected_signature (was a no-op masquerading as a test) 090e2fd
3 🚨 emit_pilot_latency blob3 swapped from duplicated tenant_id to APP_ENV for cross-env filtering 2498bd4
4 nice parse_window now UTF-8 safe (chars().last()); regression test pins it 090e2fd
5 nice MTTR fetch capped at 50,000 rows (MTTR_EVENT_FETCH_LIMIT); docs caveat 090e2fd + 8afd7a6
6 nice context_reuse_rate predicate broadened: IN (1, 'true', 'True', 'TRUE') 090e2fd
7 nice Docs corrected: "per non-public request that passes tenant authz" 8afd7a6
8 nice Docs caveat: throughput averages over requested window, not active-data window 8afd7a6
9 js_sys::Date::now() 1ms resolution noted but not actionable
10 nice pilot-smoke.sh now URL-encodes --task-type via jq @uri 8afd7a6

cargo test --lib: 272 passing (added a UTF-8 regression test, removed the no-op; net same count). cargo fmt, cargo clippy --lib --tests -- -D warnings clean.

The prior build failed because the Analytics Engine binding
introduced in this PR required the feature to be enabled on the
account. Now enabled — trigger a fresh build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevei101 stevei101 mentioned this pull request May 27, 2026
3 tasks
CF Workers Build has failed on every commit of this PR (4 failures
across 2 days) despite local build + wrangler dry-run both succeeding.
Diagnosis requires CF dashboard log access; supervisor is time-pressed.

Pragmatic unblock: comment out the [[analytics_engine_datasets]] blocks
across the default + 3 env scopes. The src/lib.rs caller already does
env.analytics_engine("PILOT_LATENCY").ok() and treats the missing
binding as non-fatal — the comment there explicitly states "Missing
binding in local dev / tests is non-fatal — emission below is
best-effort."

Impact:
- /v1/metrics/pilot endpoint: unchanged (returns the 5 D1-backed KPIs)
- KPI #6 (latency p50/p95/p99 via scripts/pilot-latency.sh): becomes
  a no-op until AE is re-wired (no rows land in the dataset)
- Other 5 KPIs (task completion, MTTR p50/p95, human intervention,
  context reuse, event throughput): unaffected

A follow-up issue tracks re-enabling AE once CF account entitlement +
Workers Builds env permissions are debugged with dashboard access.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@stevei101
Copy link
Copy Markdown
Contributor

CF Workers Build fix — commit `479b72f`

Diagnosis (without CF dashboard log access):

Check Result
`cargo check --target wasm32-unknown-unknown`
`cargo clippy --target wasm32-unknown-unknown -- -D warnings`
`worker-build --release` locally ✅ (31.8kb `index.js`)
`bunx wrangler@3 deploy --dry-run` locally ✅ (all bindings recognized including AE)
CF Workers Build (cloud) — all 4 PR commits ❌ Deployment failed

Local toolchain produces a clean build with the AE binding present. The failure is specific to CF Workers Build's deploy environment — likely API token perms, AE entitlement propagation, wrangler version pin, or plan-level gating.

Fix: commented out the four `[[analytics_engine_datasets]]` blocks (default + 3 envs) in `wrangler.toml`. The worker code in `src/lib.rs` is unchanged — `env.analytics_engine("PILOT_LATENCY").ok()` already treats the missing binding as non-fatal, so the latency-emit closure silently becomes a no-op. KPI #6 (latency p50/p95/p99 via `scripts/pilot-latency.sh`) is deferred; the other 5 KPIs continue to flow.

Tracked: #115 captures the restoration work and the root-cause hypotheses to investigate via the CF dashboard.

Watching CI now.

@stevei101 stevei101 merged commit aa0cf7e into develop May 29, 2026
4 checks passed
@stevei101 stevei101 deleted the feat/ws10-metrics-pilot-endpoint branch May 29, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants