feat(ws10): add /v1/metrics/pilot endpoint and pilot-latency script#112
Conversation
Five D1-backed KPIs (task completion, MTTR p50/p95, human intervention, event throughput, plus deferred context-reuse) returned by a new GET /v1/metrics/pilot. The sixth KPI (API latency) lives in Workers Analytics Engine and is queried by scripts/pilot-latency.sh. KPIs serialize as null with a sibling null_reasons entry when the denominator is zero — never as 0, so pilot-week go/no-go gates don't silently pass on missing data. src/metrics.rs keeps the SQL-free helpers (window parsing, percentile, MTTR delta extraction) pure-Rust so they're testable in-source; the in-source tests cover window parsing, percentile math, and the MTTR state machine (12 tests, all passing under cargo test --lib). Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
data-fabric-worker | 479b72f | Commit Preview URL | May 29 2026, 01:30 PM |
Closes the loop on KPI #6: the fetch handler now writes one data point per non-public request (path, method, tenant, elapsed ms, status) to the PILOT_LATENCY binding, which scripts/pilot-latency.sh queries for p50/p95/p99. Best-effort emission: a missing binding (local dev) or transient sink error swallows the failure, never turning a successful request into a 500. Public paths (/ and /health) are skipped. Path is emitted as-is; templating high-cardinality routes (/v1/runs/:id) is a documented follow-up. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New CountTotal numerator/denominator type with .rate() returning Some(f64) when denom > 0, None otherwise. - New PilotInputs intermediate struct holding what assemble needs. - pilot() is now a thin compose over query_inputs (D1 IO) and assemble (pure). All null-vs-value branches live in assemble and have direct unit test coverage. feat(ws10): implement stopgap context_reuse_rate Definition: among events_bronze rows with event_type='checkpoint_read', fraction with json_extract(payload, '$.hit') = 1. SQLite JSON1 returns 1 for a JSON true value. Null with reason when no checkpoint_read events in window. Producers don't yet contract on a "hit" field — rows without one count as misses (conservative default). The KPI is expected to refine before Phase 2 gates; docs/ws10/METRICS_ENDPOINT.md spells out the contract. Adds 13 new tests (now 25 metrics tests, 272 total passing). Coverage: CountTotal.rate edge cases, every assemble null/value branch, the genuine zero-numerator case (must NOT be marked null), MTTR sort robustness, serialized shape (null_reasons omitted when empty), and a compile-time guard on query_inputs's signature. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scripts/pilot-smoke.sh — curl + jq sanity check for reviewers and post-deploy. Asserts HTTP 200 + presence of every top-level key and every KPI key. Exit codes: 0 ok, 2 prereqs, 3 request, 4 shape. - docs/ws10/METRICS_ENDPOINT.md — drops the "always null" cell for context_reuse_rate, documents the stopgap definition and the implicit producer contract, adds a "Smoke test" section and an "API latency: how it gets into Analytics Engine" section with the data-point schema (index/blobs/doubles). - scripts/pilot-latency.sh — adjusts the "no rows" hint now that the Worker is in fact emitting (the prior text claimed it wasn't). Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses self-review on PR #112: #1 Concurrent D1 queries. query_inputs ran 5 sequential .await calls (~5 × RTT). Each helper is now an async fn polled together via futures_util::try_join! — Workers I/O interleaves, so the hot path drops to ~1 × RTT. #2 Remove no-op signature test. query_inputs_function_exists_with_ expected_signature defined an _assert helper but never invoked it; `let _ = query_inputs;` asserted nothing. Comment masquerading as a test. Dropped. #4 UTF-8 safe parse_window. The previous `raw.split_at(raw.len() - 1)` panicked on a multi-byte last char (e.g. `?window=1🌒`). Switched to `chars().last()` and unit char matching; added a regression test. #5 Cap MTTR event fetch at 50,000 rows. Unbounded fetch could OOM the Worker for a high-volume tenant over a long window. Constant MTTR_EVENT_FETCH_LIMIT documents the cap; docs note that p95 above the cap is approximate. #6 Broaden context_reuse predicate. json_extract returns integer 1 for a JSON boolean true, but producers in the wild also emit string "true" with various casings. Predicate now accepts (1, 'true', 'True', 'TRUE') — matches the conservative-default ethos. Also drops CountTotal::zero() in favour of Default::default() now that it's no longer reachable from the now-helper-split production path. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
emit_pilot_latency was packing tenant_id into both index1 and blob3.
index1 already exposes tenant via AE's sampling key, so blob3 was a
wasted dimension. Now carries APP_ENV (dev / staging / production),
so a shared dashboard can split traffic by env without flipping
between three separate dataset bindings.
Reads APP_ENV via env.var("APP_ENV") before the router consumes env;
falls back to "unknown" if unset (same best-effort posture as the rest
of emission).
Docstring on emit_pilot_latency now pins the schema (index1, blob1-3,
double1-2) so it stays in sync with the Analytics Engine queries.
Refs #105
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#3 docs sync: blob3 now documents APP_ENV (was tenant_id duplicate). #5 docs caveat: MTTR row cap surfaced in the KPI table — p50/p95 approximate above 50,000 failure+recovery events in window. #6 docs sync: context_reuse_rate predicate updated to reflect the broadened SQL (`IN (1, 'true', 'True', 'TRUE')`). #7 emission scope: "one row per non-public request" was misleading — only requests that pass tenant authz reach the emit path; pre-router 401 / 403 short-circuit before emission. Doc now says so. #8 throughput caveat: event_throughput_per_sec averages over the *requested* window, not the active-data window. If the tenant's first ingest was 3d ago and the caller asks for ?window=7d, rate under-reports by ~3/7. Documented in the KPI table cell. #10 pilot-smoke.sh: URL-encode --task-type via `jq -rn ... @uri`, so values like "ns:foo bar" produce a valid query string instead of a broken URL. Refs #105 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Self-review findings addressed in 3 commits on top of
|
The prior build failed because the Analytics Engine binding introduced in this PR required the feature to be enabled on the account. Now enabled — trigger a fresh build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CF Workers Build has failed on every commit of this PR (4 failures
across 2 days) despite local build + wrangler dry-run both succeeding.
Diagnosis requires CF dashboard log access; supervisor is time-pressed.
Pragmatic unblock: comment out the [[analytics_engine_datasets]] blocks
across the default + 3 env scopes. The src/lib.rs caller already does
env.analytics_engine("PILOT_LATENCY").ok() and treats the missing
binding as non-fatal — the comment there explicitly states "Missing
binding in local dev / tests is non-fatal — emission below is
best-effort."
Impact:
- /v1/metrics/pilot endpoint: unchanged (returns the 5 D1-backed KPIs)
- KPI #6 (latency p50/p95/p99 via scripts/pilot-latency.sh): becomes
a no-op until AE is re-wired (no rows land in the dataset)
- Other 5 KPIs (task completion, MTTR p50/p95, human intervention,
context reuse, event throughput): unaffected
A follow-up issue tracks re-enabling AE once CF account entitlement +
Workers Builds env permissions are debugged with dashboard access.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CF Workers Build fix — commit `479b72f`Diagnosis (without CF dashboard log access):
Local toolchain produces a clean build with the AE binding present. The failure is specific to CF Workers Build's deploy environment — likely API token perms, AE entitlement propagation, wrangler version pin, or plan-level gating. Fix: commented out the four `[[analytics_engine_datasets]]` blocks (default + 3 envs) in `wrangler.toml`. The worker code in `src/lib.rs` is unchanged — `env.analytics_engine("PILOT_LATENCY").ok()` already treats the missing binding as non-fatal, so the latency-emit closure silently becomes a no-op. KPI #6 (latency p50/p95/p99 via `scripts/pilot-latency.sh`) is deferred; the other 5 KPIs continue to flow. Tracked: #115 captures the restoration work and the root-cause hypotheses to investigate via the CF dashboard. Watching CI now. |
Closes #105.
Summary
GET /v1/metrics/pilot— single endpoint returning the 5 D1-backed pilot KPIs (task_completion_rate,mttr_p50_seconds,mttr_p95_seconds,context_reuse_rate,human_intervention_rate,event_throughput_per_sec) plussample_counts. Tenant-scoped viax-tenant-id(401 if missing).emit_pilot_latencywrites one Analytics Engine data point per non-public request;scripts/pilot-latency.shqueries the dataset for p50/p95/p99.scripts/pilot-smoke.shfor reviewer / post-deploy sanity checks; full reference indocs/ws10/METRICS_ENDPOINT.md; one-liner pointer indocs/MISSION_PLAN.md.metrics::testscovering window parsing, percentile math, the MTTR state machine,CountTotal.rateedge cases, everyassemblenull/value branch, the genuine-zero-numerator case (must NOT be marked null), and serialized-shape invariants. 272 / 272 pass.cargo fmtandcargo clippy --lib --tests -- -D warningsclean.Null-with-reason convention
A KPI serializes as
nullwith an entry in a siblingnull_reasonsobject when its denominator is zero. Zero is reserved for the genuine "numerator was actually zero" case. Thenull_reasonsobject is omitted entirely when no KPI is null.The issue's example response didn't show
null_reasons; it's added so go/no-go gates can distinguish "no data" from "0 rate". Documented indocs/ws10/METRICS_ENDPOINT.md.Acceptance criteria checklist
GET /v1/metrics/pilotimplemented insrc/lib.rs(delegates tosrc/metrics.rs)x-tenant-id(401)scripts/pilot-latency.sh), backed by per-requestwriteDataPointemission in the Workermetrics::tests) — pure helpers plus everyassemblebranchdocs/MISSION_PLAN.md+ full note indocs/ws10/METRICS_ENDPOINT.mdThe acceptance list also mentions a "seeded D1 fixture integration test". The repo has no D1 fixture harness today and the
worker::D1Databasetype is wasm-only, so adding one is a substantial standalone project. Filed as a WS10 follow-up; tracking in a separate issue. Theassemblesplit makes that follow-up significantly easier when it lands — the assembly branches are already fully covered without D1.Resolved open questions
events_bronzerows whereevent_type='checkpoint_read'ANDjson_extract(payload, '$.hit') = 1, over totalcheckpoint_readrows. Docs spell out the producer contract and that this will refine before Phase 2 gates.24hand1dresolve to the samewindow_seconds(86400). Echoedwindowis whatever the caller sent. Test pins this.emit_pilot_latencyinsrc/lib.rs), so the binding is no longer dangling.Commit walk
795f762fdca999writeDataPointemission toPILOT_LATENCYe97aac0pilotintoquery_inputs+ pureassemble; stopgap context_reuse_rate; +13 testsc970c76Test plan
cargo test --lib— 272 pass (25 inmetrics::tests)cargo fmt --checkcleancargo clippy --lib --tests -- -D warningscleanbash -nclean for bothscripts/pilot-smoke.shandscripts/pilot-latency.shcargo check --target wasm32-unknown-unknownis broken on unmodifieddeveloptoo, so wasm only runs on CI herescripts/pilot-smoke.sh --base-url https://data-fabric.stevedores.org --tenant-id <id>scripts/pilot-latency.sh --window 1hreturns non-empty rows (validates the emit path)🤖 Generated with Claude Code