Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.24.0] - 2026-04-23

### Added

- **Payload-shape telemetry: `bytes_in`, `bytes_out`, `tokens_in`, `tokens_out` (PR #134)**. Four new doubles in the `oddkit_telemetry` Analytics Engine schema (`double3`–`double6`), measured per MCP request and written from a fire-and-forget `waitUntil` callback so user-facing latency is unchanged. Bytes are UTF-8 wire-length via `TextEncoder`; tokens are cl100k_base counts via `gpt-tokenizer/encoding/cl100k_base` (chosen over `@anthropic-ai/tokenizer` after a 5-minute Node bench: ~6× faster median, dramatically better p95, ~432 KB gzipped via subpath import — see `workers/test/tokenize.test.mjs`). Schema goes from 7 doubles to 6 (full doubles array: `[count, duration_ms, bytes_in, bytes_out, tokens_in, tokens_out]`). Tokenizer is module-level singleton, lazy-loaded via dynamic import, cached across requests within an isolate. Cold-call parses the encoder once; warm-call cost is sub-millisecond on Node, in the same V8 the Workers runtime uses. Bench-vs-prod comparison validated via fifth Managed Agent smoke at session `sesn_011CaMNujMg9pymcz18JFPp8` (`tokenization-smoke-managed` consumer label): `oddkit_catalog` → 21,437 bytes_out / 5,856 tokens_out; `oddkit_time` → 178 bytes_out / 71 tokens_out; chars-per-token ratio (~3.7–4.5) consistent with the bench's prediction across all observed payload sizes.

- **Telemetry write helpers in `workers/src/tokenize.ts` (PR #134)**. New `measurePayloadShape(requestText, responseText)` returns `PayloadShape` (the 4-field struct above) given two body strings. `countTokensSafe(text)` wraps the encoder in a try/catch and returns `null` on failure so the telemetry path never throws. The call site in `workers/src/index.ts` clones the response synchronously before `return response`, then reads + measures inside `ctx.waitUntil` — clone must be synchronous because the body is a one-shot stream that the runtime drains as soon as the handler returns.

### Changed

- **No Content-Type filter on the response body (PR #134)**. The first iteration of payload-shape telemetry skipped any response whose Content-Type was not `application/json`, on the assumption that MCP responses would always be JSON. They are not — MCP's Streamable HTTP transport returns `text/event-stream` for tool calls, and the filter caused 100% of tool_call responses to record `bytes_out=0, tokens_out=0`. The filter was removed; the response body is now read regardless of Content-Type. SSE protocol overhead (~10 bytes per event) is negligible against the actual payload size, and oddkit's responses are bounded single-event streams that drain quickly. Telemetry is wrapped in a try/catch to preserve the non-breaking invariant for any future response that might fail to clone.

### Removed

- **`tokenize_ms` (formerly `double7`) — Workers runtime cannot measure it (PR #134)**. A previous iteration of the schema shipped a `tokenize_ms` field intended to capture the wall-clock cost of tokenization for bench-vs-prod comparison. Live smoke against the preview confirmed it always reads `0` in production. Cause is structural, not a bug: Cloudflare Workers freezes both `performance.now()` and `Date.now()` between network I/O events as a timing-side-channel mitigation (documented at `developers.cloudflare.com/workers/runtime-apis/web-standards/`). Tokenization is pure CPU work, so any sub-request timing of it from inside a Worker request handler is unmeasurable. The field was dropped from `PayloadShape`, the `writeDataPoint` doubles array, and the `telemetry-governance` canon doc. The bench at `workers/test/tokenize.test.mjs` characterized the cost curve once (cl100k handles 50 KB in ~1.3 ms on Node v22, the same V8 the Workers runtime uses); future per-call cost is predictable from observed `bytes_out` / `tokens_out` against that curve. See `klappy://canon/constraints/telemetry-governance` § "Why no tokenize_ms" for the published rationale.

### Fixed

- **Root `package-lock.json` version drift back-fill (this PR)**. Pre-bump state showed root `package-lock.json` at `0.23.0` while `workers/package-lock.json` was at `0.23.1` — root drifted one release behind. Both lockfiles are now bumped to `0.24.0` (top-level `version` and `packages[""].version`). The pre-commit hook enforces sync between `package.json` and `workers/package.json`; both `package-lock.json` files still require manual sync per current tooling.

### Refs

- PR (code): [klappy/oddkit#134](https://github.com/klappy/oddkit/pull/134)
- PR (canon): [klappy/klappy.dev#134](https://github.com/klappy/klappy.dev/pull/134) — telemetry-governance schema update, two new constraints (`measure-before-you-object`, `performed-prudence-anti-pattern`)
- Five Managed Agent smoke sessions (forensic record):
- `sesn_011CaMJdyWpUAm8n7YgRyLLG` — caught Content-Type filter dropping all SSE responses
- `sesn_011CaMKDLhT5zvUAUJ2HUvfW` — caught `clone()` inside `waitUntil` producing empty reader
- `sesn_011CaMLronGtL22J6R7fAPMs` — caught `performance.now()` frozen during synchronous CPU work
- `sesn_011CaMMf7tirAh2v5YoZHkxA` — caught `Date.now()` frozen too (both timers under deterministic-timing mitigation)
- `sesn_011CaMNujMg9pymcz18JFPp8` — **PASS** after dropping `tokenize_ms`; verified `bytes_in`/`bytes_out`/`tokens_in`/`tokens_out` populate with realistic varied values across tools
- Agent: `agent_011CaMJd8jvMj5CJMiQ11TdM`. Environment: `env_016RffZyqSdHeb5s3Z6UABw8`. Sonnet 4.6 throughout per `klappy://canon/constraints/release-validation-gate`.
- Canon basis: `klappy://canon/constraints/release-validation-gate`, `klappy://canon/constraints/telemetry-governance`, `klappy://canon/constraints/measure-before-you-object`, `klappy://canon/observations/performed-prudence-anti-pattern`.
- Tests: 7/7 unit (`workers/test/tokenize.test.mjs`), 6/6 integration (`workers/test/telemetry-integration.test.mjs`). Typecheck clean. Bench artifact at `workers/test/tokenize.test.mjs` (cl100k vs anthropic comparison, 200B–50KB sweep).

## [0.23.1] - 2026-04-21

### Fixed
Expand Down
2 changes: 2 additions & 0 deletions odd/ledger/learnings.jsonl
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,5 @@
{"id":"learn-20260412-0001","timestamp":"2026-04-12T00:52:00Z","summary":"Standalone Worker tools (telemetry, time) bypass orchestrate pipeline — they share oddkit_ MCP prefix but register directly in createServer with their own handler. CLI parity requires adding to TOOLS array (auto-cascades) plus explicit param threading in cli.js and server.js","trigger":"architecture","impact":"New standalone tools need 5 files touched: index.ts (Worker registration), tool-registry.js (TOOLS entry), actions.js (handler), server.js (param threading), cli.js (param threading). The TOOLS auto-derivation handles enum/listing but not param plumbing.","confidence":0.95,"sources":["workers/src/index.ts","src/core/tool-registry.js","src/core/actions.js","src/mcp/server.js","src/cli.js"],"evidence":[{"type":"artifact","ref":"PR #87 — oddkit_time implementation across 5 files"}],"candidate_targets":[],"proposed_escalation":"none"}
{"id":"L39","timestamp":"2026-04-13T11:12:00Z","type":"learning","summary":"raw.githubusercontent.com URL parsing must rejoin all path segments after owner/repo to support branch names with slashes — parts[2] truncates multi-segment refs like publish/four-essays-and-skill to just publish","context":"extractBranchRef() and getZipUrl() in zip-baseline-fetcher.ts both used parts[2] which only captured the first segment of a slash-containing branch name, causing 404s on both SHA resolution and ZIP download","resolution":"Changed to parts.slice(2).join(\"/\") in both functions — minimal 2-line fix"}
{"type":"D","summary":"E0008 challenge governance refactor: replaced hardcoded detectClaimType logic in runChallengeAction with four governance-driven fetch functions (discoverChallengeTypes, fetchBasePrerequisites, fetchNormativeVocabulary, fetchStakesCalibration). Voice-dump suppression invariant is load-bearing — questionTiers.length === 0 short-circuits all output. Four new caches cleared in runCleanupStorage. tsc clean. PR #100.","rationale":"Hardcoded challenge logic cannot evolve with governance articles; governance-driven extraction means challenge behavior updates when articles update, no code change required. Mirrors PR #96 encode precedent exactly.","context":"workers/src/orchestrate.ts, branch feat/e0008-challenge-governance-driven, commit aa4445c","date":"2026-04-17"}
{"date": "2026-04-24", "epoch": "E0008", "task": "feat/telemetry-semantic-names", "summary": "TypeScript bundler moduleResolution omits .js extensions on local imports in compiled output \u2014 Node.js ESM resolver requires explicit .js suffix. When compiling telemetry.ts for integration tests, all compiled .js files in the build dir must be post-processed to add .js to extensionless relative imports. Patch all files in the build dir, not just telemetry.js.", "detail": "telemetry.ts now imports KnowledgeBaseFetcher (a value import, not just a type import) from zip-baseline-fetcher.ts. The existing integration test only compiled tokenize.ts and telemetry.ts. Adding zip-baseline-fetcher.ts to the tsconfig include list is necessary but insufficient \u2014 the compiled JS has extensionless imports (./zip-baseline-fetcher, ./tracing) that Node ESM cannot resolve. Must patch all .js files in the build dir with a regex replace of from \"./foo\" -> from \"./foo.js\".", "pr": "https://github.com/klappy/oddkit/pull/137"}
{"date": "2026-04-24", "epoch": "E0008", "task": "feat/telemetry-semantic-names", "summary": "JSDoc block comments must not contain */ sequences \u2014 they terminate the comment prematurely. Patterns like blob*/double* in a JSDoc comment cause TypeScript parse errors. Use blob1..9/double1..6 or similar notation instead.", "detail": "detectRawSlotNames JSDoc had blob*/double* which the TypeScript parser reads as end-of-comment at the first */. tsc reported TS1109 (Expression expected) at line 459. The fix is trivial but the error message is cryptic \u2014 the real cause is invisible until you stare at the raw characters.", "pr": "https://github.com/klappy/oddkit/pull/137"}
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "oddkit",
"version": "0.23.1",
"version": "0.24.0",
"description": "Agent-first CLI for ODD-governed repos. Epistemic terrain rendering with portable baseline.",
"type": "module",
"bin": {
Expand Down
11 changes: 9 additions & 2 deletions workers/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 3 additions & 2 deletions workers/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "oddkit-mcp-worker",
"version": "0.23.1",
"version": "0.24.0",
"private": true,
"type": "module",
"scripts": {
Expand All @@ -12,7 +12,8 @@
"dependencies": {
"agents": "^0.4.1",
"fflate": "^0.8.2",
"zod": "^4.3.6"
"zod": "^4.3.6",
"gpt-tokenizer": "^3.0.0"
},
"devDependencies": {
"@cloudflare/workers-types": "^4.20250124.0",
Expand Down
76 changes: 62 additions & 14 deletions workers/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -425,20 +425,31 @@ Use when:

Dataset: oddkit_telemetry (Cloudflare Analytics Engine)
Schema:
blob1 — event_type "mcp_request" | "tool_call"
blob2 — method JSON-RPC method (e.g. "tools/call")
blob3 — tool_name oddkit action (e.g. "orient", "search")
blob4 — consumer_label best-effort caller identity
blob5 — consumer_source how label was resolved (e.g. "user-agent")
blob6 — knowledge_base_url which knowledge base is being served
blob7 — document_uri for get calls, the klappy:// URI requested
blob8 — worker_version oddkit version string
double1 — count always 1
double2 — duration_ms request processing time
index1 — sampling_key consumer label
event_type — "mcp_request" | "tool_call"
method — JSON-RPC method (e.g. "tools/call")
tool_name — oddkit action (e.g. "orient", "search")
consumer_label — best-effort caller identity
consumer_source — how label was resolved (e.g. "user-agent")
knowledge_base_url — which knowledge base is being served
document_uri — for get calls, the klappy:// URI requested
worker_version — oddkit version string
cache_tier — which storage tier served the index
count — always 1 (use SUM for aggregation)
duration_ms — request processing time (full wall-clock at worker edge)
bytes_in — UTF-8 byte length of the request body
bytes_out — UTF-8 byte length of the response body (0 for SSE streams)
tokens_in — cl100k_base token count of the request body
tokens_out — cl100k_base token count of the response body
index1 — sampling key (consumer label)

Use SUM(_sample_interval) instead of COUNT(*) to account for Analytics Engine sampling.
Time filter example: WHERE timestamp > NOW() - INTERVAL '30' DAY`,
Time filter example: WHERE timestamp > NOW() - INTERVAL '30' DAY

Example — tool leaderboard:
SELECT tool_name, SUM(_sample_interval) AS calls FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL '30' DAY GROUP BY tool_name ORDER BY calls DESC LIMIT 10

Example — payload shape by tool:
SELECT tool_name, AVG(tokens_out) AS avg_tokens_out FROM oddkit_telemetry WHERE timestamp > NOW() - INTERVAL '7' DAY GROUP BY tool_name ORDER BY avg_tokens_out DESC`,
{
sql: z.string().describe("Analytics Engine SQL query against the oddkit_telemetry dataset."),
},
Expand Down Expand Up @@ -958,14 +969,51 @@ export default {

// Phase 1 telemetry — non-blocking, fire-and-forget (E0008)
// Phase 1.5: cache_tier from tracer feeds blob9 (E0008.1)
// Phase 2: payload shape (bytes_in/out, tokens_in/out) feeds doubles
// 3-6. tokenize_ms was tried and dropped — Workers freezes both
// performance.now() and Date.now() during synchronous CPU work, making
// sub-request timing of pure-CPU tokenization unmeasurable. Response body is
// measured universally — MCP's Streamable HTTP transport returns SSE,
// not JSON, so a Content-Type filter would (and did) drop almost every
// response. The helper handles clone failures safely.
if (telemetryClone) {
const durationMs = Date.now() - startTime;
const cacheTier = tracer.indexSource;
// NOTE: Do NOT read tracer.indexSource here. The MCP handler returns
// a streaming Response — `await handler(...)` resolves with the
// Response object before the tool handler closure has finished
// running, so the tracer has not yet recorded the `index` span at
// this point. Reading here yields "none" for every tool. The tracer
// is only fully populated once the response body has been consumed
// (which forces the streaming tool handler to complete). The read
// therefore happens inside the waitUntil callback below, after
// `await responseClone.text()` resolves.
// Clone the response synchronously before returning so the body is
// still available to read inside the deferred waitUntil callback.
const responseClone = response.clone();
Comment thread
cursor[bot] marked this conversation as resolved.

ctx.waitUntil(
(async () => {
try {
const requestText = await telemetryClone.text();

const { measurePayloadShape } = await import("./tokenize");
const { recordTelemetry } = await import("./telemetry");
await recordTelemetry(telemetryClone, env, durationMs, cacheTier);

let responseText = "";
try {
responseText = await responseClone.text();
} catch {
// Fall through with empty string; bytes_out / tokens_out will be 0.
}
// Read tracer.indexSource AFTER the response body has been
// consumed. By this point the streaming tool handler has
// completed and any "index" / "index-build" spans have been
// recorded. Reading earlier (e.g. immediately after `await
// handler()` returned) was the streaming-race bug that caused
// every tool call to record cache_tier="none" in production.
const cacheTier = tracer.indexSource;
const shape = await measurePayloadShape(requestText, responseText);
recordTelemetry(request, requestText, env, durationMs, cacheTier, shape);
} catch {
// Telemetry must never break MCP requests
}
Expand Down
Loading
Loading