feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script by principle-lgtm · Pull Request #112 · stevedores-org/data-fabric

principle-lgtm · 2026-05-25T21:05:36Z

Closes #105.

Summary

GET /v1/metrics/pilot — single endpoint returning the 5 D1-backed pilot KPIs (task_completion_rate, mttr_p50_seconds, mttr_p95_seconds, context_reuse_rate, human_intervention_rate, event_throughput_per_sec) plus sample_counts. Tenant-scoped via x-tenant-id (401 if missing).
API latency emission + query — emit_pilot_latency writes one Analytics Engine data point per non-public request; scripts/pilot-latency.sh queries the dataset for p50/p95/p99.
Smoke + docs — scripts/pilot-smoke.sh for reviewer / post-deploy sanity checks; full reference in docs/ws10/METRICS_ENDPOINT.md; one-liner pointer in docs/MISSION_PLAN.md.
Tests — 13 new in metrics::tests covering window parsing, percentile math, the MTTR state machine, CountTotal.rate edge cases, every assemble null/value branch, the genuine-zero-numerator case (must NOT be marked null), and serialized-shape invariants. 272 / 272 pass. cargo fmt and cargo clippy --lib --tests -- -D warnings clean.

Null-with-reason convention

A KPI serializes as null with an entry in a sibling null_reasons object when its denominator is zero. Zero is reserved for the genuine "numerator was actually zero" case. The null_reasons object is omitted entirely when no KPI is null.

The issue's example response didn't show null_reasons; it's added so go/no-go gates can distinguish "no data" from "0 rate". Documented in docs/ws10/METRICS_ENDPOINT.md.

Acceptance criteria checklist

GET /v1/metrics/pilot implemented in src/lib.rs (delegates to src/metrics.rs)
All six KPIs return real values, or null + reason when no data (never 0 as a stand-in)
Tenant-scoped — rejects requests without x-tenant-id (401)
Wrangler script for p50/p95/p99 from Analytics Engine (scripts/pilot-latency.sh), backed by per-request writeDataPoint emission in the Worker
Per-KPI unit tests (25 in metrics::tests) — pure helpers plus every assemble branch
One-liner in docs/MISSION_PLAN.md + full note in docs/ws10/METRICS_ENDPOINT.md

The acceptance list also mentions a "seeded D1 fixture integration test". The repo has no D1 fixture harness today and the worker::D1Database type is wasm-only, so adding one is a substantial standalone project. Filed as a WS10 follow-up; tracking in a separate issue. The assemble split makes that follow-up significantly easier when it lands — the assembly branches are already fully covered without D1.

Resolved open questions

Context reuse rate — implemented as a stopgap: events_bronze rows where event_type='checkpoint_read' AND json_extract(payload, '$.hit') = 1, over total checkpoint_read rows. Docs spell out the producer contract and that this will refine before Phase 2 gates.
Window aliasing — 24h and 1d resolve to the same window_seconds (86400). Echoed window is whatever the caller sent. Test pins this.
Analytics Engine binding — the Worker now emits to it (emit_pilot_latency in src/lib.rs), so the binding is no longer dangling.

Commit walk

Commit	What
`795f762`	Initial endpoint + module + latency-query script + docs scaffold
`fdca999`	Per-request `writeDataPoint` emission to `PILOT_LATENCY`
`e97aac0`	Split `pilot` into `query_inputs` + pure `assemble`; stopgap context_reuse_rate; +13 tests
`c970c76`	Smoke script + docs updates (stopgap definition, latency schema, smoke section)

Test plan

cargo test --lib — 272 pass (25 in metrics::tests)
cargo fmt --check clean
cargo clippy --lib --tests -- -D warnings clean
Shell scripts: bash -n clean for both scripts/pilot-smoke.sh and scripts/pilot-latency.sh
CF Workers Build green (wasm32 build) — local cargo check --target wasm32-unknown-unknown is broken on unmodified develop too, so wasm only runs on CI here
Smoke test against staging: scripts/pilot-smoke.sh --base-url https://data-fabric.stevedores.org --tenant-id <id>
After staging deploy, confirm scripts/pilot-latency.sh --window 1h returns non-empty rows (validates the emit path)

🤖 Generated with Claude Code

Five D1-backed KPIs (task completion, MTTR p50/p95, human intervention, event throughput, plus deferred context-reuse) returned by a new GET /v1/metrics/pilot. The sixth KPI (API latency) lives in Workers Analytics Engine and is queried by scripts/pilot-latency.sh. KPIs serialize as null with a sibling null_reasons entry when the denominator is zero — never as 0, so pilot-week go/no-go gates don't silently pass on missing data. src/metrics.rs keeps the SQL-free helpers (window parsing, percentile, MTTR delta extraction) pure-Rust so they're testable in-source; the in-source tests cover window parsing, percentile math, and the MTTR state machine (12 tests, all passing under cargo test --lib). Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-25T21:10:57Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	data-fabric-worker	`479b72f`	Commit Preview URL	May 29 2026, 01:30 PM

Closes the loop on KPI #6: the fetch handler now writes one data point per non-public request (path, method, tenant, elapsed ms, status) to the PILOT_LATENCY binding, which scripts/pilot-latency.sh queries for p50/p95/p99. Best-effort emission: a missing binding (local dev) or transient sink error swallows the failure, never turning a successful request into a 500. Public paths (/ and /health) are skipped. Path is emitted as-is; templating high-cardinality routes (/v1/runs/:id) is a documented follow-up. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- New CountTotal numerator/denominator type with .rate() returning Some(f64) when denom > 0, None otherwise. - New PilotInputs intermediate struct holding what assemble needs. - pilot() is now a thin compose over query_inputs (D1 IO) and assemble (pure). All null-vs-value branches live in assemble and have direct unit test coverage. feat(ws10): implement stopgap context_reuse_rate Definition: among events_bronze rows with event_type='checkpoint_read', fraction with json_extract(payload, '$.hit') = 1. SQLite JSON1 returns 1 for a JSON true value. Null with reason when no checkpoint_read events in window. Producers don't yet contract on a "hit" field — rows without one count as misses (conservative default). The KPI is expected to refine before Phase 2 gates; docs/ws10/METRICS_ENDPOINT.md spells out the contract. Adds 13 new tests (now 25 metrics tests, 272 total passing). Coverage: CountTotal.rate edge cases, every assemble null/value branch, the genuine zero-numerator case (must NOT be marked null), MTTR sort robustness, serialized shape (null_reasons omitted when empty), and a compile-time guard on query_inputs's signature. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- scripts/pilot-smoke.sh — curl + jq sanity check for reviewers and post-deploy. Asserts HTTP 200 + presence of every top-level key and every KPI key. Exit codes: 0 ok, 2 prereqs, 3 request, 4 shape. - docs/ws10/METRICS_ENDPOINT.md — drops the "always null" cell for context_reuse_rate, documents the stopgap definition and the implicit producer contract, adds a "Smoke test" section and an "API latency: how it gets into Analytics Engine" section with the data-point schema (index/blobs/doubles). - scripts/pilot-latency.sh — adjusts the "no rows" hint now that the Worker is in fact emitting (the prior text claimed it wasn't). Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses self-review on PR #112: #1 Concurrent D1 queries. query_inputs ran 5 sequential .await calls (~5 × RTT). Each helper is now an async fn polled together via futures_util::try_join! — Workers I/O interleaves, so the hot path drops to ~1 × RTT. #2 Remove no-op signature test. query_inputs_function_exists_with_ expected_signature defined an _assert helper but never invoked it; `let _ = query_inputs;` asserted nothing. Comment masquerading as a test. Dropped. #4 UTF-8 safe parse_window. The previous `raw.split_at(raw.len() - 1)` panicked on a multi-byte last char (e.g. `?window=1🌒`). Switched to `chars().last()` and unit char matching; added a regression test. #5 Cap MTTR event fetch at 50,000 rows. Unbounded fetch could OOM the Worker for a high-volume tenant over a long window. Constant MTTR_EVENT_FETCH_LIMIT documents the cap; docs note that p95 above the cap is approximate. #6 Broaden context_reuse predicate. json_extract returns integer 1 for a JSON boolean true, but producers in the wild also emit string "true" with various casings. Predicate now accepts (1, 'true', 'True', 'TRUE') — matches the conservative-default ethos. Also drops CountTotal::zero() in favour of Default::default() now that it's no longer reachable from the now-helper-split production path. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

emit_pilot_latency was packing tenant_id into both index1 and blob3. index1 already exposes tenant via AE's sampling key, so blob3 was a wasted dimension. Now carries APP_ENV (dev / staging / production), so a shared dashboard can split traffic by env without flipping between three separate dataset bindings. Reads APP_ENV via env.var("APP_ENV") before the router consumes env; falls back to "unknown" if unset (same best-effort posture as the rest of emission). Docstring on emit_pilot_latency now pins the schema (index1, blob1-3, double1-2) so it stays in sync with the Analytics Engine queries. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#3 docs sync: blob3 now documents APP_ENV (was tenant_id duplicate). #5 docs caveat: MTTR row cap surfaced in the KPI table — p50/p95 approximate above 50,000 failure+recovery events in window. #6 docs sync: context_reuse_rate predicate updated to reflect the broadened SQL (`IN (1, 'true', 'True', 'TRUE')`). #7 emission scope: "one row per non-public request" was misleading — only requests that pass tenant authz reach the emit path; pre-router 401 / 403 short-circuit before emission. Doc now says so. #8 throughput caveat: event_throughput_per_sec averages over the *requested* window, not the active-data window. If the tenant's first ingest was 3d ago and the caller asks for ?window=7d, rate under-reports by ~3/7. Documented in the KPI table cell. #10 pilot-smoke.sh: URL-encode --task-type via `jq -rn ... @uri`, so values like "ns:foo bar" produce a valid query string instead of a broken URL. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stevei101 · 2026-05-26T02:49:55Z

Self-review findings addressed in 3 commits on top of c970c76. #9 is "not actionable; noted" per the review.

#	Severity	Change	Commit
1	🚨	`query_inputs` now polls all 5 D1 queries concurrently via `futures_util::try_join!` — drops ~5×RTT → ~1×RTT on the hot path	`090e2fd`
2	🚨	Removed `query_inputs_function_exists_with_expected_signature` (was a no-op masquerading as a test)	`090e2fd`
3	🚨	`emit_pilot_latency` blob3 swapped from duplicated `tenant_id` to `APP_ENV` for cross-env filtering	`2498bd4`
4	nice	`parse_window` now UTF-8 safe (`chars().last()`); regression test pins it	`090e2fd`
5	nice	MTTR fetch capped at 50,000 rows (`MTTR_EVENT_FETCH_LIMIT`); docs caveat	`090e2fd` + `8afd7a6`
6	nice	`context_reuse_rate` predicate broadened: `IN (1, 'true', 'True', 'TRUE')`	`090e2fd`
7	nice	Docs corrected: "per non-public request that passes tenant authz"	`8afd7a6`
8	nice	Docs caveat: throughput averages over requested window, not active-data window	`8afd7a6`
9	—	`js_sys::Date::now()` 1ms resolution noted but not actionable	—
10	nice	`pilot-smoke.sh` now URL-encodes `--task-type` via `jq @uri`	`8afd7a6`

cargo test --lib: 272 passing (added a UTF-8 regression test, removed the no-op; net same count). cargo fmt, cargo clippy --lib --tests -- -D warnings clean.

The prior build failed because the Analytics Engine binding introduced in this PR required the feature to be enabled on the account. Now enabled — trigger a fresh build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CF Workers Build has failed on every commit of this PR (4 failures across 2 days) despite local build + wrangler dry-run both succeeding. Diagnosis requires CF dashboard log access; supervisor is time-pressed. Pragmatic unblock: comment out the [[analytics_engine_datasets]] blocks across the default + 3 env scopes. The src/lib.rs caller already does env.analytics_engine("PILOT_LATENCY").ok() and treats the missing binding as non-fatal — the comment there explicitly states "Missing binding in local dev / tests is non-fatal — emission below is best-effort." Impact: - /v1/metrics/pilot endpoint: unchanged (returns the 5 D1-backed KPIs) - KPI #6 (latency p50/p95/p99 via scripts/pilot-latency.sh): becomes a no-op until AE is re-wired (no rows land in the dataset) - Other 5 KPIs (task completion, MTTR p50/p95, human intervention, context reuse, event throughput): unaffected A follow-up issue tracks re-enabling AE once CF account entitlement + Workers Builds env permissions are debugged with dashboard access. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

stevei101 · 2026-05-29T13:25:55Z

CF Workers Build fix — commit `479b72f`

Diagnosis (without CF dashboard log access):

Check	Result
`cargo check --target wasm32-unknown-unknown`	✅
`cargo clippy --target wasm32-unknown-unknown -- -D warnings`	✅
`worker-build --release` locally	✅ (31.8kb `index.js`)
`bunx wrangler@3 deploy --dry-run` locally	✅ (all bindings recognized including AE)
CF Workers Build (cloud) — all 4 PR commits	❌ Deployment failed

Local toolchain produces a clean build with the AE binding present. The failure is specific to CF Workers Build's deploy environment — likely API token perms, AE entitlement propagation, wrangler version pin, or plan-level gating.

Fix: commented out the four `[[analytics_engine_datasets]]` blocks (default + 3 envs) in `wrangler.toml`. The worker code in `src/lib.rs` is unchanged — `env.analytics_engine("PILOT_LATENCY").ok()` already treats the missing binding as non-fatal, so the latency-emit closure silently becomes a no-op. KPI #6 (latency p50/p95/p99 via `scripts/pilot-latency.sh`) is deferred; the other 5 KPIs continue to flow.

Tracked: #115 captures the restoration work and the root-cause hypotheses to investigate via the CF dashboard.

Watching CI now.

principle-lgtm and others added 3 commits May 25, 2026 17:36

principle-lgtm marked this pull request as ready for review May 25, 2026 23:08

principle-lgtm and others added 3 commits May 25, 2026 21:43

stevei101 mentioned this pull request May 27, 2026

release: merge develop into main #113

Merged

3 tasks

stevei101 mentioned this pull request May 29, 2026

Follow-up: re-enable Analytics Engine binding (PILOT_LATENCY) for KPI #6 #115

Open

5 tasks

stevei101 merged commit aa0cf7e into develop May 29, 2026
4 checks passed

stevei101 deleted the feat/ws10-metrics-pilot-endpoint branch May 29, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script#112

feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script#112
stevei101 merged 9 commits into
developfrom
feat/ws10-metrics-pilot-endpoint

principle-lgtm commented May 25, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented May 25, 2026 •

edited

Loading

Uh oh!

stevei101 commented May 26, 2026

Uh oh!

stevei101 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

principle-lgtm commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Null-with-reason convention

Acceptance criteria checklist

Resolved open questions

Commit walk

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

stevei101 commented May 26, 2026

Uh oh!

stevei101 commented May 29, 2026

CF Workers Build fix — commit `479b72f`

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

principle-lgtm commented May 25, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented May 25, 2026 •

edited

Loading