From 6c5f921185ba1aec57644ac6a9313e3562583eb7 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 07:42:00 +0300 Subject: [PATCH 01/21] docs: add OpenTelemetry instrumentation design (issue #34) Records the v1 design decisions for adding OTel tracing support: a first-party plugin shipped from pg-workflows with an optional peer dep on @opentelemetry/api, a new wrap hook on WorkflowPlugin, per-execution span lifetime, and cache-hit suppression for replayed steps. Metrics, cross- execution context propagation, and DLQ spans are explicitly deferred. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-21-otel-instrumentation-design.md | 202 ++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md diff --git a/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md b/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md new file mode 100644 index 0000000..4fc6d31 --- /dev/null +++ b/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md @@ -0,0 +1,202 @@ +# OpenTelemetry Instrumentation — Design + +- **Issue:** [#34](https://github.com/SokratisVidros/pg-workflows/issues/34) +- **Status:** Approved for implementation +- **Date:** 2026-05-21 + +## Goal + +Allow pg-workflows users to emit OpenTelemetry traces for workflow and step execution, with zero runtime cost when unused. + +## Scope (v1) + +**In scope:** + +- A first-party plugin, `otelPlugin`, shipped from the `pg-workflows` package. +- A `workflow.run` span per worker execution of a workflow run, with child spans for each step kind (`step.run`, `step.wait_for`, `step.delay`, `step.wait_until`, `step.pause`, `step.poll`, `step.invoke_child_workflow`). +- Hierarchical traces via OpenTelemetry's AsyncLocalStorage active context (no manual context plumbing in user workflows). +- Suppression of spans for cache-hit step replays. +- Optional peer dependency on `@opentelemetry/api`. Non-users pay zero cost. + +**Out of scope for v1** (see [Out of scope](#out-of-scope-for-v1) below for rationale and deferral notes). + +## Decisions + +| Decision | Choice | +| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| Distribution | First-party plugin in `pg-workflows`. Optional peer dep on `@opentelemetry/api`. | +| Scope | Step spans + a parent `workflow.run` span (hierarchical traces). Metrics deferred. | +| Span lifetime | One span per worker execution of the run. A long-paused workflow produces multiple traces, stitched via `workflow.id` / `workflow.run_id` attributes. | +| Plugin hook shape | A new optional `wrap(context, next)` hook on `WorkflowPlugin`. Composes as middleware. Better fit for `tracer.startActiveSpan` than a before/after pair. | +| Cache-hit replay handling | Skip spans for cache-hit step calls. Detected via `context.timeline[stepId]?.output !== undefined` (plus the invoke-child binding key for that step kind). | + +## Architecture + +### Plugin interface extension (`src/types.ts`) + +```ts +export interface WorkflowPlugin { + name: string; + methods: (step: TStepBase, context: WorkflowContext) => TStepExt; + wrap?: (context: WorkflowContext, next: () => Promise) => Promise; +} +``` + +`methods` gains a `context` argument so plugins can inspect the timeline for cache-hit detection. The change is additive — existing plugins that ignore the new arg compile unchanged. + +`wrap` is optional. When present, the engine inserts it into a middleware chain around the workflow handler invocation. + +### Engine wiring (`src/engine.ts`) + +Inside `handleWorkflowRun`, after composing `step` via `plugin.methods(step, context)`, the handler call site changes from: + +```ts +const result = await workflow.handler(context); +``` + +to: + +```ts +let next: () => Promise = () => workflow.handler(context); +for (const plugin of [...plugins].reverse()) { + if (plugin.wrap) { + const inner = next; + next = () => plugin.wrap!(context, inner); + } +} +const result = await next(); +``` + +Order rules: the first plugin passed to `.use()` is the outermost wrap. Multiple plugins compose as standard middleware. + +### OTel plugin (`src/plugins/otel.ts`) + +Exported from the package's main entry as `otelPlugin`. + +**Public API:** + +```ts +import { otelPlugin } from 'pg-workflows'; +import { trace } from '@opentelemetry/api'; + +const tracedWorkflow = workflow.use(otelPlugin({ + tracer: trace.getTracer('my-app'), // optional; default: trace.getTracer('pg-workflows', VERSION) + spanNamePrefix: 'pg_workflows', // optional; default shown + attributes: (ctx) => ({ tenant: ctx.input.tenantId }), // optional; merged onto workflow.run span +})); +``` + +**Behaviour:** + +- `wrap` opens a `${spanNamePrefix}.workflow.run` active span around `next()`. On thrown error: `span.recordException(err)`, `setStatus({ code: ERROR })`, re-throw. On clean return: `setStatus({ code: OK })`. Span ends in `finally`. +- `methods` returns a step API where every method is wrapped to open `${spanNamePrefix}.step.` spans, but only when the corresponding timeline slot is empty. +- All spans share parent context via `tracer.startActiveSpan`. AsyncLocalStorage handles propagation through `await` boundaries automatically. + +### Span hierarchy and attributes + +``` +pg_workflows.workflow.run +├── pg_workflows.step.run +├── pg_workflows.step.wait_for +├── pg_workflows.step.delay +├── pg_workflows.step.wait_until +├── pg_workflows.step.pause +├── pg_workflows.step.poll +└── pg_workflows.step.invoke_child_workflow +``` + +| Span | Attributes | +| ------------------------------------ | --------------------------------------------------------------------------------------------------- | +| `workflow.run` | `workflow.id`, `workflow.run_id`, `workflow.resource_id` (if present), `workflow.attempt` (= `run.retryCount`), plus anything from the user's `attributes(ctx)` callback | +| `step.` (all kinds) | `step.id`, `step.type` (matches the `StepType` enum value) | +| `step.invoke_child_workflow` | Plus `child.workflow_id`, `child.run_id` once the child run has been created | +| Any span on error | `recordException(err)`, `setStatus({ code: ERROR, message })` | + +### Cache-hit suppression + +Before opening a span, each wrapped step method checks: + +```ts +function isCachedHit(ctx: WorkflowContext, stepId: string, kind: StepType): boolean { + const entry = ctx.timeline[stepId]; + if (entry && typeof entry === 'object' && 'output' in entry && (entry as any).output !== undefined) { + return true; + } + if (kind === StepType.INVOKE_CHILD_WORKFLOW) { + const binding = ctx.timeline[`__invokeChildWorkflow:${stepId}`]; + if (binding) return true; // in-flight resume; will produce no new work this execution + } + return false; +} +``` + +When cached, the wrapper passes through to the base step method without opening a span. The timeline snapshot is taken at handler entry, so steps completed during the *current* execution are still spanned correctly. + +### Packaging + +In `package.json`: + +```json +"peerDependencies": { + "pg": "^8.0.0", + "@opentelemetry/api": "^1.9.0" +}, +"peerDependenciesMeta": { + "@opentelemetry/api": { "optional": true } +} +``` + +The OTel plugin file imports `@opentelemetry/api` directly. Users who never import `otelPlugin` never load this module, so the optional peer never resolves. + +Devs add `@opentelemetry/sdk-trace-base` to `devDependencies` for tests. + +## Testing + +Lives in `src/plugins/otel.test.ts`, runs in the existing unit suite (PGlite-backed). + +Test setup registers a `BasicTracerProvider` with an `InMemorySpanExporter` once per test, asserts against `exporter.getFinishedSpans()`. + +Cases: + +1. **Single-step happy path** — one `step.run` produces exactly 2 spans: `workflow.run` parent + `step.run` child. Attributes match. Both `OK`. +2. **Multi-step with pause** — workflow runs `step1.run` → `step2.waitFor`. First execution emits `workflow.run` + `step1.run` + `step2.wait_for`. `triggerEvent` resumes; second execution emits a new `workflow.run` trace containing only the post-pause work (cached `step1` and the resumed `step2` emit no spans). +3. **Step throws** — `step.run`'s handler throws. The `step.run` span has `ERROR` status with a recorded exception. The error propagates so `run.error` is persisted and pg-boss retry semantics are unchanged. +4. **`invokeChildWorkflow` cache replay** — parent's `step.invoke_child_workflow` span is emitted on the pause execution. On the resume execution, the binding key is present and the cached output completes, so no span is emitted. +5. **Plugin composition order** — register a trivial second wrap plugin alongside `otelPlugin` (in both orders) and assert wraps compose in `.use()` registration order. +6. **Cache-hit predicate unit test** — direct test of the `isCachedHit` predicate against the timeline shapes produced by each step kind. + +## Documentation + +- New "Observability with OpenTelemetry" section in `README.md` with a ~10-line quickstart: register provider → `.use(otelPlugin())` → done. +- JSDoc on `otelPlugin` listing all options and defaults. +- Bullet under "Core API" in `AGENTS.md` pointing to the plugin. + +## Out of scope for v1 + +These items appear in the original issue but are deferred. Documented here so they aren't lost. + +### Metrics + +The issue proposes `pg_workflows.workflow.started`, `pg_workflows.workflow.completed`, `pg_workflows.step.duration`, `pg_workflows.queue.depth`. These use OTel's metrics API (`@opentelemetry/api/metrics`), a separate surface from traces. They can layer onto the same plugin hooks added in v1, so the v1 plugin interface remains forward-compatible. + +`queue.depth` is harder than the rest — pg-boss does not expose a synchronous queue-size primitive; implementing it requires either polling `pgboss.job` or a counter maintained at enqueue/dequeue time. Defer until there is concrete demand. + +### Cross-execution trace context propagation + +When a workflow pauses and resumes, the resume execution gets a fresh root span — there is no link to the previous execution's trace beyond shared `workflow.run_id` attributes. Linking them would require persisting the trace context (`traceparent` header value) somewhere durable, e.g. in `workflow_runs.timeline` or a dedicated column. + +Same for `step.invoke_child_workflow`: child runs currently start a fresh root span rather than continuing the parent's trace. + +Both deferred together because they share the persistence design question. + +### `engine.startWorkflow` caller context propagation + +When an HTTP request invokes `engine.startWorkflow`, the request's incoming trace context is not propagated into the workflow run. Same persistence question as above; deferred together. + +### DLQ span emission + +`handleWorkflowRunDlq` runs outside the workflow's plugin chain (no handler invocation, no `context` object). DLQ-induced FAILED states therefore produce no `workflow.run` span. This is acceptable for v1 because the precipitating error is already recorded on the last per-execution `workflow.run` span via the catch path. Revisit if users report missing visibility on final-failure reconciliation. + +### Sampling, head-based vs tail-based decisions + +The plugin defers to the user's configured `TracerProvider` for sampling. No plugin-level sampling controls in v1. From bd62d419dbd67605f0d882abbcb0a00733fb5bc8 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 08:02:25 +0300 Subject: [PATCH 02/21] docs: add OTel instrumentation implementation plan (issue #34) Step-by-step TDD plan for the design committed in be52240. 15 bite-sized tasks covering: package wiring, plugin interface extension, engine wrap chain, OTel plugin (workflow.run + step.* spans, cache-hit suppression, error path), tests with InMemorySpanExporter, README and AGENTS docs. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../plans/2026-05-21-otel-instrumentation.md | 1577 +++++++++++++++++ 1 file changed, 1577 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-21-otel-instrumentation.md diff --git a/docs/superpowers/plans/2026-05-21-otel-instrumentation.md b/docs/superpowers/plans/2026-05-21-otel-instrumentation.md new file mode 100644 index 0000000..8ffdbcd --- /dev/null +++ b/docs/superpowers/plans/2026-05-21-otel-instrumentation.md @@ -0,0 +1,1577 @@ +# OpenTelemetry Instrumentation Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Ship a first-party `otelPlugin` that emits OpenTelemetry spans for workflow and step execution, with zero cost when not used. + +**Architecture:** Add an optional `wrap(context, next)` hook to `WorkflowPlugin` and pass `context` into `methods(step, context)`. The engine composes plugin wraps as middleware around the workflow handler. The OTel plugin opens one `workflow.run` span per execution via `wrap` and wraps every step method to open a child span, suppressing spans for cache-hit replays by inspecting `context.timeline`. + +**Tech Stack:** TypeScript ESM/CJS, Vitest (unit suite uses PGlite), Biome (no semicolons, single quotes), `@opentelemetry/api` (optional peer), `@opentelemetry/sdk-trace-base` + `@opentelemetry/context-async-hooks` (devDeps for tests). + +**Spec:** `docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md` + +--- + +## File Map + +- **Create:** `src/plugins/otel.ts` — the plugin (~120 LOC). +- **Create:** `src/plugins/otel.test.ts` — full test coverage (~300 LOC). +- **Create:** `src/plugins/otel-test-helpers.ts` — tracer/exporter bootstrap shared by tests. +- **Modify:** `src/types.ts` — extend `WorkflowPlugin` with `wrap?` and add `context` param to `methods`. +- **Modify:** `src/engine.ts` — pass `context` to `plugin.methods`, compose `plugin.wrap` chain around handler call. +- **Modify:** `src/index.ts` — export `otelPlugin`. +- **Modify:** `package.json` — `@opentelemetry/api` optional peer dep, plus devDeps for testing. +- **Modify:** `README.md` — add Observability section. +- **Modify:** `AGENTS.md` — bullet under Core API. + +--- + +## Task 1: Add OpenTelemetry dependencies + +**Files:** +- Modify: `package.json` + +- [ ] **Step 1: Add `@opentelemetry/api` as optional peer dep and add devDeps** + +Edit `package.json`. Add to `peerDependencies`: + +```json +"peerDependencies": { + "pg": "^8.0.0", + "@opentelemetry/api": "^1.9.0" +} +``` + +Add new top-level `peerDependenciesMeta`: + +```json +"peerDependenciesMeta": { + "@opentelemetry/api": { "optional": true } +} +``` + +Add to `devDependencies` (keep alphabetical order): + +```json +"@opentelemetry/api": "^1.9.0", +"@opentelemetry/context-async-hooks": "^1.27.0", +"@opentelemetry/sdk-trace-base": "^1.27.0" +``` + +- [ ] **Step 2: Install** + +Run: `npm install` +Expected: lockfile updates; no errors. + +- [ ] **Step 3: Verify the rest of the build still works** + +Run: `npm run build` +Expected: exits 0. + +- [ ] **Step 4: Commit** + +```bash +git add package.json package-lock.json +git commit -m "build: add OpenTelemetry deps for otelPlugin" +``` + +--- + +## Task 2: Extend `WorkflowPlugin` interface in types.ts + +**Files:** +- Modify: `src/types.ts:90-96` + +- [ ] **Step 1: Update `WorkflowPlugin` interface** + +In `src/types.ts`, replace the `WorkflowPlugin` interface: + +```ts +/** + * Plugin that extends the workflow step API with extra methods. + * @template TStepBase - The step type this plugin receives (base + previous plugins). + * @template TStepExt - The extra methods this plugin adds to step. + */ +export interface WorkflowPlugin { + name: string + methods: (step: TStepBase, context: WorkflowContext) => TStepExt + /** + * Optional middleware around the workflow handler call. Composes in + * registration order — the first plugin passed to `.use()` wraps everything + * inside. Implementations MUST call `next()` exactly once. + */ + wrap?: (context: WorkflowContext, next: () => Promise) => Promise +} +``` + +- [ ] **Step 2: Run typecheck** + +Run: `npx tsc --noEmit` +Expected: existing plugin tests in `src/engine.test.ts` still compile (their `methods: (step) => ({...})` is assignable to `(step, context) => ({...})` because TS allows passing fewer params). + +- [ ] **Step 3: Run unit suite** + +Run: `npm run test:unit` +Expected: all existing tests pass; no behavioural change yet. + +- [ ] **Step 4: Commit** + +```bash +git add src/types.ts +git commit -m "feat(types): add wrap hook and context arg to WorkflowPlugin" +``` + +--- + +## Task 3: Wire engine to pass context and compose wrap chain + +**Files:** +- Modify: `src/engine.ts:1124-1140` +- Modify: `src/engine.test.ts` (add wrap composition test) + +- [ ] **Step 1: Write the failing test for wrap composition** + +Append to `src/engine.test.ts` inside the `describe('workflow.use(plugin)', () => { ... })` block: + +```ts +it('should call plugin.wrap around the handler and compose multiple wraps in registration order', async () => { + const calls: string[] = [] + + const outerPlugin: WorkflowPlugin = { + name: 'outer', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('outer:before') + const result = await next() + calls.push('outer:after') + return result + }, + } + + const innerPlugin: WorkflowPlugin = { + name: 'inner', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('inner:before') + const result = await next() + calls.push('inner:after') + return result + }, + } + + const engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }) + await engine.start() + + const wrapped = workflow + .use(outerPlugin) + .use(innerPlugin)('wrap-order-workflow', async ({ step }) => { + calls.push('handler') + await step.run('only-step', async () => 'ok') + return 'done' + }) + + await engine.registerWorkflow(wrapped) + const run = await engine.startWorkflow({ workflowId: 'wrap-order-workflow', input: {} }) + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + expect(calls).toEqual([ + 'outer:before', + 'inner:before', + 'handler', + 'inner:after', + 'outer:after', + ]) + + await engine.stop() +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `npm run test:unit -- engine.test.ts -t "compose multiple wraps"` +Expected: FAIL (wrap is not invoked yet). + +- [ ] **Step 3: Modify `handleWorkflowRun` to pass context and compose wraps** + +In `src/engine.ts`, locate the block that currently reads (around lines 1124–1140): + +```ts + let step = { ...baseStep }; + const plugins = workflow.plugins ?? []; + for (const plugin of plugins) { + const extra = plugin.methods(step); + step = { ...step, ...extra }; + } + + const context: WorkflowContext = { + input: run.input as InferInputParameters, + workflowId: run.workflowId, + runId: run.id, + timeline: run.timeline, + logger: this.logger, + step, + }; + + const result = await workflow.handler(context); +``` + +Replace it with: + +```ts + const plugins = workflow.plugins ?? []; + + const context: WorkflowContext = { + input: run.input as InferInputParameters, + workflowId: run.workflowId, + runId: run.id, + timeline: run.timeline, + logger: this.logger, + // step is populated below once plugins.methods has run + step: baseStep as WorkflowContext['step'], + }; + + let step = { ...baseStep }; + for (const plugin of plugins) { + const extra = plugin.methods(step, context); + step = { ...step, ...extra }; + } + context.step = step as WorkflowContext['step']; + + let next: () => Promise = () => workflow.handler(context); + for (const plugin of [...plugins].reverse()) { + if (plugin.wrap) { + const inner = next; + const wrap = plugin.wrap; + next = () => wrap(context, inner); + } + } + + const result = await next(); +``` + +Rationale: +- `context` is constructed before `plugin.methods` runs so methods can read `context.timeline` for cache-hit detection. +- `context.step` is assigned the composed step API afterward (the same object the handler sees). +- The wrap chain is built bottom-up: the last plugin's wrap is innermost, the first plugin's wrap is outermost. Plugins without `wrap` are skipped. + +- [ ] **Step 4: Run the failing test to verify it now passes** + +Run: `npm run test:unit -- engine.test.ts -t "compose multiple wraps"` +Expected: PASS, with `calls` in the exact order asserted. + +- [ ] **Step 5: Run the full unit suite to confirm no regressions** + +Run: `npm run test:unit` +Expected: all pass. + +- [ ] **Step 6: Commit** + +```bash +git add src/engine.ts src/engine.test.ts +git commit -m "feat(engine): compose plugin.wrap middleware and pass context to methods" +``` + +--- + +## Task 4: Create OTel test helpers + +**Files:** +- Create: `src/plugins/otel-test-helpers.ts` + +- [ ] **Step 1: Create the helper module** + +Create `src/plugins/otel-test-helpers.ts`: + +```ts +import { AsyncHooksContextManager } from '@opentelemetry/context-async-hooks' +import { context, type Tracer, trace } from '@opentelemetry/api' +import { + BasicTracerProvider, + InMemorySpanExporter, + type ReadableSpan, + SimpleSpanProcessor, +} from '@opentelemetry/sdk-trace-base' + +/** + * Build a fresh tracer + in-memory exporter for a single test. + * Callers MUST invoke `teardown()` in `afterEach`. + */ +export function setupOtel(): { + tracer: Tracer + getSpans: () => ReadableSpan[] + getSpansByName: (name: string) => ReadableSpan[] + teardown: () => Promise +} { + const exporter = new InMemorySpanExporter() + const provider = new BasicTracerProvider({ + spanProcessors: [new SimpleSpanProcessor(exporter)], + }) + + // AsyncHooks context manager is required for nested step spans to attach + // to the workflow.run span across `await` boundaries. We register it + // globally because OTel's context API reads from the global manager. + const contextManager = new AsyncHooksContextManager().enable() + context.setGlobalContextManager(contextManager) + + const tracer = provider.getTracer('pg-workflows-test') + + return { + tracer, + getSpans: () => exporter.getFinishedSpans(), + getSpansByName: (name: string) => + exporter.getFinishedSpans().filter((s) => s.name === name), + teardown: async () => { + await provider.shutdown() + contextManager.disable() + context.disable() + trace.disable() + }, + } +} +``` + +- [ ] **Step 2: Run typecheck** + +Run: `npx tsc --noEmit` +Expected: clean. + +- [ ] **Step 3: Commit** + +```bash +git add src/plugins/otel-test-helpers.ts +git commit -m "test: add OTel test bootstrap helper" +``` + +--- + +## Task 5: Create plugin skeleton + +**Files:** +- Create: `src/plugins/otel.ts` +- Create: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Write the failing test — plugin registers and a workflow completes** + +Create `src/plugins/otel.test.ts`: + +```ts +import type pg from 'pg' +import type { PgBoss } from 'pg-boss' +import { afterAll, afterEach, beforeAll, beforeEach, describe, expect, it } from 'vitest' +import { workflow } from '../definition' +import { WorkflowEngine } from '../engine' +import { getBoss } from '../tests/pgboss' +import { closeTestDatabase, createTestDatabase } from '../tests/test-db' +import { WorkflowStatus } from '../types' +import { otelPlugin } from './otel' +import { setupOtel } from './otel-test-helpers' + +let testBoss: PgBoss +let testPool: pg.Pool + +beforeAll(async () => { + testPool = await createTestDatabase() + testBoss = await getBoss(testPool) +}) + +afterAll(async () => { + await closeTestDatabase() +}) + +describe('otelPlugin', () => { + let otel: ReturnType + let engine: WorkflowEngine + + beforeEach(async () => { + otel = setupOtel() + engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }) + await engine.start() + }) + + afterEach(async () => { + await engine.stop() + await otel.teardown() + }) + + it('registers and lets a workflow complete', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-smoke', + async ({ step }) => { + return await step.run('only', async () => 'ok') + }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-smoke', input: {} }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED, output: 'ok' }) + }) +}) +``` + +- [ ] **Step 2: Run test — should fail because `./otel` does not exist** + +Run: `npm run test:unit -- otel.test.ts` +Expected: FAIL with module-not-found. + +- [ ] **Step 3: Create skeleton plugin** + +Create `src/plugins/otel.ts`: + +```ts +import type { Tracer } from '@opentelemetry/api' +import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' + +export type OtelPluginOptions = { + /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ + tracer?: Tracer + /** Prefix for all span names. Defaults to `pg_workflows`. */ + spanNamePrefix?: string + /** Extra attributes merged onto the workflow.run span. */ + attributes?: (context: WorkflowContext) => Record +} + +const DEFAULT_PREFIX = 'pg_workflows' + +export function otelPlugin( + _options: OtelPluginOptions = {}, +): WorkflowPlugin { + return { + name: 'opentelemetry', + methods: () => ({}), + } +} +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `npm run test:unit -- otel.test.ts` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): plugin skeleton" +``` + +--- + +## Task 6: Expose `resourceId` and `attempt` on `WorkflowContext` + +**Files:** +- Modify: `src/types.ts` +- Modify: `src/engine.ts` + +The `workflow.run` span (next task) needs `workflow.resource_id` and `workflow.attempt`. The current `WorkflowContext` doesn't expose them. Add the fields and populate from `run` in the engine. This is a pure-additive refactor — no new behaviour yet. + +- [ ] **Step 1: Extend `WorkflowContext` type** + +In `src/types.ts`, find the `WorkflowContext` type (around line 102) and add two new fields: + +```ts +export type WorkflowContext< + TInput extends InputParameters = InputParameters, + TStep extends StepBaseContext = StepBaseContext, +> = { + input: InferInputParameters + step: TStep + workflowId: string + runId: string + /** Tenant/scope identifier set when the run was started, if any. */ + resourceId?: string + /** Zero-based retry attempt number (= `run.retryCount`). */ + attempt: number + timeline: Record + logger: WorkflowLogger +} +``` + +- [ ] **Step 2: Populate the new fields in the engine** + +In `src/engine.ts`, in the `context` construction inside `handleWorkflowRun` (added in Task 3), change to: + +```ts + const context: WorkflowContext = { + input: run.input as InferInputParameters, + workflowId: run.workflowId, + runId: run.id, + resourceId: run.resourceId ?? undefined, + attempt: run.retryCount, + timeline: run.timeline, + logger: this.logger, + step: baseStep as WorkflowContext['step'], + }; +``` + +- [ ] **Step 3: Run typecheck and unit suite** + +Run: `npx tsc --noEmit && npm run test:unit` +Expected: all pass; no behaviour change yet. + +- [ ] **Step 4: Commit** + +```bash +git add src/types.ts src/engine.ts +git commit -m "feat(types): expose resourceId and attempt on WorkflowContext" +``` + +--- + +## Task 7: Implement `workflow.run` span (happy path) + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Add failing test for workflow.run span** + +Append to the `describe('otelPlugin', ...)` block in `src/plugins/otel.test.ts`: + +```ts +it('emits a workflow.run span on successful completion', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-wf-span', + async () => 'done', + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ + resourceId: 'tenant-1', + workflowId: 'otel-wf-span', + input: {}, + }) + await expect + .poll(async () => await engine.getRun({ runId: run.id, resourceId: 'tenant-1' })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + const spans = otel.getSpansByName('pg_workflows.workflow.run') + expect(spans).toHaveLength(1) + expect(spans[0].attributes).toMatchObject({ + 'workflow.id': 'otel-wf-span', + 'workflow.run_id': run.id, + 'workflow.resource_id': 'tenant-1', + 'workflow.attempt': 0, + }) + expect(spans[0].status.code).toBe(1) // SpanStatusCode.OK +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `npm run test:unit -- otel.test.ts -t "workflow.run span on successful"` +Expected: FAIL — span list is empty (plugin still has no `wrap`). + +- [ ] **Step 3: Implement `wrap` in the plugin** + +Replace the contents of `src/plugins/otel.ts` with: + +```ts +import { + type AttributeValue, + SpanStatusCode, + type Tracer, + trace, +} from '@opentelemetry/api' +import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' + +export type OtelPluginOptions = { + /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ + tracer?: Tracer + /** Prefix for all span names. Defaults to `pg_workflows`. */ + spanNamePrefix?: string + /** Extra attributes merged onto the workflow.run span. */ + attributes?: (context: WorkflowContext) => Record +} + +const DEFAULT_PREFIX = 'pg_workflows' + +export function otelPlugin( + options: OtelPluginOptions = {}, +): WorkflowPlugin { + const tracer = options.tracer ?? trace.getTracer('pg-workflows') + const prefix = options.spanNamePrefix ?? DEFAULT_PREFIX + const extraAttrs = options.attributes + + return { + name: 'opentelemetry', + + methods: () => ({}), + + wrap: (context, next) => + tracer.startActiveSpan( + `${prefix}.workflow.run`, + { + attributes: { + 'workflow.id': context.workflowId, + 'workflow.run_id': context.runId, + 'workflow.attempt': context.attempt, + ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), + ...(extraAttrs ? extraAttrs(context) : {}), + }, + }, + async (span) => { + try { + const result = await next() + span.setStatus({ code: SpanStatusCode.OK }) + return result + } finally { + span.end() + } + }, + ), + } +} +``` + +- [ ] **Step 4: Run the test** + +Run: `npm run test:unit -- otel.test.ts -t "workflow.run span on successful"` +Expected: PASS. + +- [ ] **Step 5: Run full unit suite — confirm no regressions** + +Run: `npm run test:unit` +Expected: all pass. + +- [ ] **Step 6: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): emit workflow.run span via wrap hook" +``` + +--- + +## Task 8: `workflow.run` span error path + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Write failing test** + +Append to `describe('otelPlugin', ...)` in `src/plugins/otel.test.ts`: + +```ts +it('records exception and ERROR status on workflow.run when handler throws', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-wf-throw', + async ({ step }) => { + await step.run('boom', async () => { + throw new Error('kaboom') + }) + }, + { retries: 0 }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-wf-throw', input: {} }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.FAILED }) + + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] + expect(wfSpan.status.code).toBe(2) // SpanStatusCode.ERROR + expect(wfSpan.status.message).toBe('kaboom') + expect(wfSpan.events.some((e) => e.name === 'exception')).toBe(true) +}) +``` + +- [ ] **Step 2: Run to confirm failure** + +Run: `npm run test:unit -- otel.test.ts -t "ERROR status on workflow.run"` +Expected: FAIL — current `wrap` does not catch. + +- [ ] **Step 3: Update `wrap` to record exceptions** + +In `src/plugins/otel.ts`, replace the `wrap` arrow body: + +```ts + wrap: (context, next) => + tracer.startActiveSpan( + `${prefix}.workflow.run`, + { + attributes: { + 'workflow.id': context.workflowId, + 'workflow.run_id': context.runId, + 'workflow.attempt': context.attempt, + ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), + ...(extraAttrs ? extraAttrs(context) : {}), + }, + }, + async (span) => { + try { + const result = await next() + span.setStatus({ code: SpanStatusCode.OK }) + return result + } catch (err) { + const error = err instanceof Error ? err : new Error(String(err)) + span.recordException(error) + span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) + throw err + } finally { + span.end() + } + }, + ), +``` + +- [ ] **Step 4: Run test — should pass** + +Run: `npm run test:unit -- otel.test.ts -t "ERROR status on workflow.run"` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): record exception on workflow.run span on failure" +``` + +--- + +## Task 9: `step.run` span with cache-hit suppression and error handling + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Write three failing tests** + +Append to `src/plugins/otel.test.ts`: + +```ts +it('emits step.run span as a child of workflow.run', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-step-run-child', + async ({ step }) => { + return await step.run('foo', async () => 'bar') + }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-step-run-child', input: {} }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] + const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0] + expect(stepSpan).toBeDefined() + expect(stepSpan.attributes).toMatchObject({ 'step.id': 'foo', 'step.type': 'run' }) + expect(stepSpan.parentSpanContext?.spanId).toBe(wfSpan.spanContext().spanId) +}) + +it('skips step.run span on cache-hit replay', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-cache-skip', + async ({ step }) => { + const a = await step.run('first', async () => 'A') + await step.waitFor('gate', { eventName: 'go' }) + const b = await step.run('second', async () => 'B') + return { a, b } + }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-cache-skip', input: {} }) + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.PAUSED }) + + // First execution: workflow.run + step.run('first') + step.waitFor('gate') + expect(otel.getSpansByName('pg_workflows.step.run').map((s) => s.attributes['step.id'])).toEqual([ + 'first', + ]) + + await engine.triggerEvent({ runId: run.id, eventName: 'go' }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + // Second execution: NEW workflow.run + step.run('second') only. + // 'first' is a cache hit and emits no span. + const stepRunSpans = otel.getSpansByName('pg_workflows.step.run') + const ids = stepRunSpans.map((s) => s.attributes['step.id']) + expect(ids).toEqual(['first', 'second']) + expect(otel.getSpansByName('pg_workflows.workflow.run')).toHaveLength(2) +}) + +it('records exception and ERROR status on step.run when handler throws', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-step-throw', + async ({ step }) => { + await step.run('explode', async () => { + throw new Error('nope') + }) + }, + { retries: 0 }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-step-throw', input: {} }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.FAILED }) + + const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0] + expect(stepSpan.status.code).toBe(2) + expect(stepSpan.status.message).toBe('nope') + expect(stepSpan.events.some((e) => e.name === 'exception')).toBe(true) +}) +``` + +- [ ] **Step 2: Run tests to confirm they fail** + +Run: `npm run test:unit -- otel.test.ts -t "step.run"` +Expected: all three FAIL — `methods` is still `() => ({})`. + +- [ ] **Step 3: Add a cache-hit predicate and step.run wrapper** + +In `src/plugins/otel.ts`, replace the file with this complete version: + +```ts +import { + type AttributeValue, + SpanStatusCode, + type Tracer, + trace, +} from '@opentelemetry/api' +import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' + +export type OtelPluginOptions = { + /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ + tracer?: Tracer + /** Prefix for all span names. Defaults to `pg_workflows`. */ + spanNamePrefix?: string + /** Extra attributes merged onto the workflow.run span. */ + attributes?: (context: WorkflowContext) => Record +} + +const DEFAULT_PREFIX = 'pg_workflows' + +function isCachedHit(timeline: Record, stepId: string): boolean { + const entry = timeline[stepId] + if ( + entry && + typeof entry === 'object' && + 'output' in entry && + (entry as { output: unknown }).output !== undefined + ) { + return true + } + return false +} + +async function traceStep( + tracer: Tracer, + name: string, + attrs: Record, + fn: () => Promise, +): Promise { + return tracer.startActiveSpan(name, { attributes: attrs }, async (span) => { + try { + const result = await fn() + span.setStatus({ code: SpanStatusCode.OK }) + return result + } catch (err) { + const error = err instanceof Error ? err : new Error(String(err)) + span.recordException(error) + span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) + throw err + } finally { + span.end() + } + }) +} + +export function otelPlugin( + options: OtelPluginOptions = {}, +): WorkflowPlugin { + const tracer = options.tracer ?? trace.getTracer('pg-workflows') + const prefix = options.spanNamePrefix ?? DEFAULT_PREFIX + const extraAttrs = options.attributes + + return { + name: 'opentelemetry', + + methods: (step, context) => ({ + run: async (stepId: string, handler: () => Promise) => { + if (isCachedHit(context.timeline, stepId)) { + return step.run(stepId, handler) + } + return traceStep( + tracer, + `${prefix}.step.run`, + { 'step.id': stepId, 'step.type': 'run' }, + () => step.run(stepId, handler), + ) + }, + }), + + wrap: (context, next) => + tracer.startActiveSpan( + `${prefix}.workflow.run`, + { + attributes: { + 'workflow.id': context.workflowId, + 'workflow.run_id': context.runId, + 'workflow.attempt': context.attempt, + ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), + ...(extraAttrs ? extraAttrs(context) : {}), + }, + }, + async (span) => { + try { + const result = await next() + span.setStatus({ code: SpanStatusCode.OK }) + return result + } catch (err) { + const error = err instanceof Error ? err : new Error(String(err)) + span.recordException(error) + span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) + throw err + } finally { + span.end() + } + }, + ), + } +} +``` + +Note: `methods` overrides `run` only — `step.run` returns the existing base method otherwise (which lives on the `step` object passed in). The other base methods (`waitFor`, `pause`, etc.) are still accessible because the engine merges `extra` over `step` (see `src/engine.ts:1128-1129`); overriding `run` shadows only that one method. + +- [ ] **Step 4: Run all three step.run tests — they should pass** + +Run: `npm run test:unit -- otel.test.ts -t "step.run"` +Expected: PASS. + +- [ ] **Step 5: Run full unit suite — confirm no regressions** + +Run: `npm run test:unit` +Expected: all pass. + +- [ ] **Step 6: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): wrap step.run with span, cache-hit suppression, error path" +``` + +--- + +## Task 10: Spans for `waitFor`, `delay`, `waitUntil`, `pause` + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Write failing test** + +Append to `src/plugins/otel.test.ts`: + +```ts +it('emits spans for waitFor, delay, waitUntil, pause', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-other-steps', + async ({ step }) => { + await step.waitFor('wf', { eventName: 'evt' }) + await step.delay('d', '1ms') + await step.waitUntil('wu', new Date(Date.now() + 1)) + await step.pause('p') + return 'ok' + }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-other-steps', input: {} }) + + // Workflow pauses immediately on first waitFor; resume it through completion. + const drive = async () => { + for (let i = 0; i < 20; i++) { + const r = await engine.getRun({ runId: run.id }) + if (r.status === WorkflowStatus.PAUSED) break + await new Promise((res) => setTimeout(res, 25)) + } + } + await drive() + await engine.triggerEvent({ runId: run.id, eventName: 'evt' }) + await drive() + // delay + waitUntil resolve themselves; pause needs an explicit resume + await engine.resumeWorkflow({ runId: run.id }) + await expect + .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + const stepNames = otel + .getSpans() + .map((s) => s.name) + .filter((n) => n.startsWith('pg_workflows.step.')) + expect(stepNames).toEqual( + expect.arrayContaining([ + 'pg_workflows.step.waitFor', + 'pg_workflows.step.delay', + 'pg_workflows.step.waitUntil', + 'pg_workflows.step.pause', + ]), + ) + const waitForSpan = otel.getSpansByName('pg_workflows.step.waitFor')[0] + expect(waitForSpan.attributes).toMatchObject({ 'step.id': 'wf', 'step.type': 'waitFor' }) +}) +``` + +- [ ] **Step 2: Run test — should fail** + +Run: `npm run test:unit -- otel.test.ts -t "spans for waitFor"` +Expected: FAIL. + +- [ ] **Step 3: Extend `methods` with the four new wrappers** + +In `src/plugins/otel.ts`, replace the `methods` field of the returned plugin with: + +```ts + methods: (step, context) => ({ + run: async (stepId: string, handler: () => Promise) => { + if (isCachedHit(context.timeline, stepId)) { + return step.run(stepId, handler) + } + return traceStep( + tracer, + `${prefix}.step.run`, + { 'step.id': stepId, 'step.type': 'run' }, + () => step.run(stepId, handler), + ) + }, + waitFor: ((stepId: string, opts: Parameters[1]) => { + if (isCachedHit(context.timeline, stepId)) { + return step.waitFor(stepId, opts) + } + return traceStep( + tracer, + `${prefix}.step.waitFor`, + { 'step.id': stepId, 'step.type': 'waitFor' }, + () => step.waitFor(stepId, opts) as Promise, + ) + }) as StepBaseContext['waitFor'], + delay: async (stepId: string, duration: Parameters[1]) => { + if (isCachedHit(context.timeline, stepId)) { + return step.delay(stepId, duration) + } + await traceStep( + tracer, + `${prefix}.step.delay`, + { 'step.id': stepId, 'step.type': 'delay' }, + () => step.delay(stepId, duration), + ) + }, + waitUntil: ((stepId: string, dateOrOptions: Parameters[1]) => { + if (isCachedHit(context.timeline, stepId)) { + return step.waitUntil(stepId, dateOrOptions) + } + return traceStep( + tracer, + `${prefix}.step.waitUntil`, + { 'step.id': stepId, 'step.type': 'waitUntil' }, + () => step.waitUntil(stepId, dateOrOptions), + ) + }) as StepBaseContext['waitUntil'], + pause: async (stepId: string) => { + if (isCachedHit(context.timeline, stepId)) { + return step.pause(stepId) + } + await traceStep( + tracer, + `${prefix}.step.pause`, + { 'step.id': stepId, 'step.type': 'pause' }, + () => step.pause(stepId), + ) + }, + }), +``` + +The `as StepBaseContext['waitFor']` / `as StepBaseContext['waitUntil']` casts are required because both methods are overloaded — TypeScript can't infer the overload union from the implementation alone. + +- [ ] **Step 4: Run test** + +Run: `npm run test:unit -- otel.test.ts -t "spans for waitFor"` +Expected: PASS. + +- [ ] **Step 5: Run full unit suite** + +Run: `npm run test:unit` +Expected: all pass. + +- [ ] **Step 6: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): wrap waitFor, delay, waitUntil, pause with spans" +``` + +--- + +## Task 11: `step.poll` span + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Write failing test** + +Append to `src/plugins/otel.test.ts`: + +```ts +it('emits step.poll span on each poll attempt', async () => { + let attempt = 0 + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-poll', + async ({ step }) => { + const result = await step.poll( + 'poller', + async () => { + attempt += 1 + return attempt >= 2 ? { value: attempt } : false + }, + { interval: '30s', timeout: '60s' }, + ) + return result + }, + ) + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-poll', input: {} }) + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.PAUSED }) + + // First execution emitted exactly one step.poll span + const firstPolls = otel.getSpansByName('pg_workflows.step.poll') + expect(firstPolls).toHaveLength(1) + expect(firstPolls[0].attributes).toMatchObject({ 'step.id': 'poller', 'step.type': 'poll' }) + + // Simulate the poll-interval re-fire via fastForwardWorkflow + await engine.fastForwardWorkflow({ runId: run.id }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + // Second execution emits a new poll span (the previous one is not a cache hit + // because the step's *output* is not yet in timeline, only a poll-state entry) + expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2) +}) +``` + +- [ ] **Step 2: Run — should fail** + +Run: `npm run test:unit -- otel.test.ts -t "step.poll"` +Expected: FAIL. + +- [ ] **Step 3: Add `poll` wrapper to `methods`** + +In `src/plugins/otel.ts`, inside the `methods` returned object (Task 10), add after `pause`: + +```ts + poll: (async ( + stepId: string, + conditionFn: () => Promise, + pollOptions?: Parameters[2], + ) => { + if (isCachedHit(context.timeline, stepId)) { + return step.poll(stepId, conditionFn, pollOptions) + } + return traceStep( + tracer, + `${prefix}.step.poll`, + { 'step.id': stepId, 'step.type': 'poll' }, + () => step.poll(stepId, conditionFn, pollOptions), + ) + }) as StepBaseContext['poll'], +``` + +- [ ] **Step 4: Run tests** + +Run: `npm run test:unit -- otel.test.ts -t "step.poll"` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): wrap step.poll with span" +``` + +--- + +## Task 12: `step.invokeChildWorkflow` span with binding-key cache check + +**Files:** +- Modify: `src/plugins/otel.ts` +- Modify: `src/plugins/otel.test.ts` + +The cache-hit detection for `invokeChildWorkflow` is different: an in-flight child resume has a binding entry (`__invokeChildWorkflow:`) but no `[stepId].output` yet. We must skip the span in that case too. + +- [ ] **Step 1: Write failing test** + +Append to `src/plugins/otel.test.ts`: + +```ts +it('emits invokeChildWorkflow span on creation and skips on cache-hit resume', async () => { + const child = workflow('otel-child', async () => 'child-done') + await engine.registerWorkflow(child) + + const parent = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-parent', + async ({ step }) => { + const r = await step.invokeChildWorkflow('call-child', child) + return r + }, + ) + await engine.registerWorkflow(parent) + const run = await engine.startWorkflow({ workflowId: 'otel-parent', input: {} }) + + await expect + .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + const invokeSpans = otel.getSpansByName('pg_workflows.step.invokeChildWorkflow') + expect(invokeSpans).toHaveLength(1) + expect(invokeSpans[0].attributes).toMatchObject({ + 'step.id': 'call-child', + 'step.type': 'invokeChildWorkflow', + }) +}) +``` + +The single-span assertion proves both behaviors: a span is emitted on the create-and-pause execution, and on the resume execution the cached binding (plus eventual cached output) prevents a duplicate span. + +- [ ] **Step 2: Run — should fail** + +Run: `npm run test:unit -- otel.test.ts -t "invokeChildWorkflow"` +Expected: FAIL. + +- [ ] **Step 3: Import the binding-key helper and extend cache predicate** + +In `src/plugins/otel.ts`, add an import at the top: + +```ts +import { invokeChildWorkflowTimelineKey } from '../constants' +``` + +Replace `isCachedHit` with a kind-aware version: + +```ts +function isCachedHit( + timeline: Record, + stepId: string, + kind: 'run' | 'waitFor' | 'delay' | 'waitUntil' | 'pause' | 'poll' | 'invokeChildWorkflow', +): boolean { + const entry = timeline[stepId] + if ( + entry && + typeof entry === 'object' && + 'output' in entry && + (entry as { output: unknown }).output !== undefined + ) { + return true + } + if (kind === 'invokeChildWorkflow' && timeline[invokeChildWorkflowTimelineKey(stepId)]) { + return true + } + return false +} +``` + +Update every existing caller in `methods` to pass the new `kind` arg. Example for `run`: + +```ts + if (isCachedHit(context.timeline, stepId, 'run')) { + return step.run(stepId, handler) + } +``` + +Apply the same pattern to `waitFor` ('waitFor'), `delay` ('delay'), `waitUntil` ('waitUntil'), `pause` ('pause'), `poll` ('poll'). + +- [ ] **Step 4: Add the `invokeChildWorkflow` wrapper to `methods`** + +Inside the `methods` returned object, after `poll`, add: + +```ts + invokeChildWorkflow: (async ( + stepId: string, + refOrParams: Parameters[1], + inputArg?: unknown, + optionsArg?: unknown, + ) => { + if (isCachedHit(context.timeline, stepId, 'invokeChildWorkflow')) { + return (step.invokeChildWorkflow as ( + ...args: unknown[] + ) => Promise)(stepId, refOrParams, inputArg, optionsArg) + } + return traceStep( + tracer, + `${prefix}.step.invokeChildWorkflow`, + { 'step.id': stepId, 'step.type': 'invokeChildWorkflow' }, + () => + (step.invokeChildWorkflow as ( + ...args: unknown[] + ) => Promise)(stepId, refOrParams, inputArg, optionsArg), + ) + }) as StepBaseContext['invokeChildWorkflow'], +``` + +- [ ] **Step 5: Run test** + +Run: `npm run test:unit -- otel.test.ts -t "invokeChildWorkflow"` +Expected: PASS. + +- [ ] **Step 6: Run full unit suite** + +Run: `npm run test:unit` +Expected: all pass. + +- [ ] **Step 7: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "feat(otel): wrap step.invokeChildWorkflow with binding-aware cache check" +``` + +--- + +## Task 13: Cache-hit predicate unit test + +**Files:** +- Modify: `src/plugins/otel.test.ts` +- Modify: `src/plugins/otel.ts` (export `isCachedHit`) + +- [ ] **Step 1: Export `isCachedHit` from the plugin module** + +In `src/plugins/otel.ts`, change `function isCachedHit` to `export function isCachedHit`. + +- [ ] **Step 2: Write the unit test** + +Append to `src/plugins/otel.test.ts` *outside* the existing `describe('otelPlugin', ...)` block (top level inside the file): + +```ts +import { invokeChildWorkflowTimelineKey } from '../constants' +import { isCachedHit } from './otel' + +describe('isCachedHit', () => { + it('returns true when output is recorded for stepId', () => { + expect(isCachedHit({ s: { output: 'x', timestamp: new Date() } }, 's', 'run')).toBe(true) + }) + + it('returns false when output is undefined', () => { + expect(isCachedHit({ s: { output: undefined, timestamp: new Date() } }, 's', 'run')).toBe( + false, + ) + }) + + it('returns false when timeline has no entry for stepId', () => { + expect(isCachedHit({}, 's', 'run')).toBe(false) + }) + + it('returns false for non-object entry', () => { + expect(isCachedHit({ s: 'not-an-object' }, 's', 'run')).toBe(false) + }) + + it('returns true for invokeChildWorkflow when only the binding key is present', () => { + const timeline = { [invokeChildWorkflowTimelineKey('s')]: { invokeChildWorkflow: {} } } + expect(isCachedHit(timeline, 's', 'invokeChildWorkflow')).toBe(true) + expect(isCachedHit(timeline, 's', 'run')).toBe(false) + }) +}) +``` + +- [ ] **Step 3: Run tests** + +Run: `npm run test:unit -- otel.test.ts -t "isCachedHit"` +Expected: PASS (all 5 cases). + +- [ ] **Step 4: Commit** + +```bash +git add src/plugins/otel.ts src/plugins/otel.test.ts +git commit -m "test(otel): direct coverage for isCachedHit predicate" +``` + +--- + +## Task 14: Plugin composition order with otelPlugin + +**Files:** +- Modify: `src/plugins/otel.test.ts` + +- [ ] **Step 1: Add composition test** + +Append to the `describe('otelPlugin', ...)` block in `src/plugins/otel.test.ts`: + +```ts +it('composes wrap with another plugin in registration order', async () => { + const calls: string[] = [] + const trackerPlugin: WorkflowPlugin = { + name: 'tracker', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('tracker:before') + const r = await next() + calls.push('tracker:after') + return r + }, + } + + const w = workflow + .use(trackerPlugin) + .use(otelPlugin({ tracer: otel.tracer }))('otel-compose', async () => 'ok') + await engine.registerWorkflow(w) + const run = await engine.startWorkflow({ workflowId: 'otel-compose', input: {} }) + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }) + + // tracker registered first, so its wrap is outermost — its before runs + // before the workflow.run span opens, and its after runs after the span ends. + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] + expect(wfSpan).toBeDefined() + expect(calls).toEqual(['tracker:before', 'tracker:after']) +}) +``` + +Add `import type { StepBaseContext, WorkflowPlugin } from '../types'` to the top of the file if not already present. + +- [ ] **Step 2: Run test** + +Run: `npm run test:unit -- otel.test.ts -t "composes wrap"` +Expected: PASS. + +- [ ] **Step 3: Commit** + +```bash +git add src/plugins/otel.test.ts +git commit -m "test(otel): verify plugin composition order with another wrap" +``` + +--- + +## Task 15: Export `otelPlugin` and document + +**Files:** +- Modify: `src/index.ts` +- Modify: `README.md` +- Modify: `AGENTS.md` + +- [ ] **Step 1: Re-export from the main entry** + +In `src/index.ts`, add: + +```ts +export { otelPlugin, type OtelPluginOptions } from './plugins/otel' +``` + +- [ ] **Step 2: Add Observability section to README.md** + +In `README.md`, add a new top-level section near the existing API documentation (preserve the project's tone and heading level — `##`): + +````markdown +## Observability with OpenTelemetry + +pg-workflows ships a first-party plugin that emits OTel spans for workflow and step execution. `@opentelemetry/api` is an optional peer dependency — install it only if you want tracing. + +```bash +npm install @opentelemetry/api @opentelemetry/sdk-node +``` + +```ts +import { NodeSDK } from '@opentelemetry/sdk-node' +import { trace } from '@opentelemetry/api' +import { workflow, otelPlugin } from 'pg-workflows' + +// Initialize your OTel SDK however you normally do — for Node apps the +// NodeSDK registers an AsyncHooks context manager, which is required for +// hierarchical (parent/child) spans across async boundaries. +new NodeSDK({ /* exporters, resource, ... */ }).start() + +const tracedWorkflow = workflow.use(otelPlugin()) + +const myWorkflow = tracedWorkflow('checkout', async ({ step }) => { + await step.run('charge', async () => { /* ... */ }) + await step.waitFor('await-shipment', { eventName: 'shipped' }) +}) +``` + +The plugin emits a `pg_workflows.workflow.run` span per worker execution (one per resume cycle), with child spans per step kind (`pg_workflows.step.run`, `pg_workflows.step.waitFor`, etc.). Spans carry `workflow.id`, `workflow.run_id`, `workflow.attempt` and, where set, `workflow.resource_id`. Steps replayed from cache after a pause emit no spans. + +**Options:** + +```ts +otelPlugin({ + tracer: trace.getTracer('my-app'), // default: trace.getTracer('pg-workflows') + spanNamePrefix: 'pg_workflows', // default shown + attributes: (ctx) => ({ tenant: ctx.resourceId }), // extra static attrs on workflow.run +}) +``` + +Metrics, distributed trace context propagation across child workflows, and HTTP-caller context propagation are not in v1 — see [the design doc](docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md) for the deferral rationale. +```` + +- [ ] **Step 3: Add a bullet to AGENTS.md under Core API** + +In `AGENTS.md` (which is also `CLAUDE.md`), find the `## Core API` section. Add a new subsection after the existing `WorkflowEngine` block: + +```markdown +### `otelPlugin(options?)` - OpenTelemetry tracing + +```typescript +import { workflow, otelPlugin } from 'pg-workflows'; + +// Optional peer dep: install `@opentelemetry/api` and an OTel SDK (e.g. NodeSDK). +// One `pg_workflows.workflow.run` span per worker execution, with child spans +// per step kind. Spans replayed from cache after a pause are suppressed. +const tracedWorkflow = workflow.use(otelPlugin({ + // tracer?: Tracer // default: trace.getTracer('pg-workflows') + // spanNamePrefix?: string // default: 'pg_workflows' + // attributes?: (ctx) => Record +})); +``` +``` + +- [ ] **Step 4: Run full unit suite and build** + +Run: `npm run test:unit` +Expected: all pass. + +Run: `npm run build` +Expected: exits 0. + +Run: `npm run lint` +Expected: exits 0 (or run `npm run lint:fix` and re-stage if Biome flags formatting). + +- [ ] **Step 5: Commit** + +```bash +git add src/index.ts README.md AGENTS.md +git commit -m "feat(otel): export otelPlugin and document usage" +``` + +--- + +## Verification before declaring done + +- [ ] **Step 1: Full test suite passes** + +Run: `npm test` +Expected: unit + integration both green. If integration requires a Postgres URL the user hasn't provided, run only `npm run test:unit` and note the gap. + +- [ ] **Step 2: Build cleanly** + +Run: `npm run clean && npm run build` +Expected: exits 0. `dist/` contains the plugin output. + +- [ ] **Step 3: Lint** + +Run: `npm run lint` +Expected: clean. Otherwise `npm run lint:fix` and re-stage anything modified. + +- [ ] **Step 4: Spec coverage walk-through** + +Open `docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md` and confirm every "In scope" bullet has a matching task. Confirm every "Out of scope for v1" bullet is documented in the README's deferral pointer. From f930809e58f5aeb317fb6e0f384a48e3337cbfe0 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 08:10:53 +0300 Subject: [PATCH 03/21] build: add OpenTelemetry deps for otelPlugin --- package-lock.json | 93 +++++++++++++++++++++++++++++++++++++++++++++++ package.json | 9 +++++ 2 files changed, 102 insertions(+) diff --git a/package-lock.json b/package-lock.json index be8a2c6..4d4faa3 100644 --- a/package-lock.json +++ b/package-lock.json @@ -19,6 +19,9 @@ "devDependencies": { "@biomejs/biome": "^2.3.10", "@electric-sql/pglite": "^0.3.14", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/context-async-hooks": "^1.27.0", + "@opentelemetry/sdk-trace-base": "^1.27.0", "@types/node": "^22.10.2", "@types/pg": "^8.11.10", "bunup": "^0.16.11", @@ -31,7 +34,13 @@ "node": ">=18.0.0" }, "peerDependencies": { + "@opentelemetry/api": "^1.9.0", "pg": "^8.0.0" + }, + "peerDependenciesMeta": { + "@opentelemetry/api": { + "optional": true + } } }, "node_modules/@babel/helper-string-parser": { @@ -360,6 +369,90 @@ "@emnapi/runtime": "^1.7.1" } }, + "node_modules/@opentelemetry/api": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/api/-/api-1.9.1.tgz", + "integrity": "sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=8.0.0" + } + }, + "node_modules/@opentelemetry/context-async-hooks": { + "version": "1.30.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/context-async-hooks/-/context-async-hooks-1.30.1.tgz", + "integrity": "sha512-s5vvxXPVdjqS3kTLKMeBMvop9hbWkwzBpu+mUO2M7sZtlkyDJGwFe33wRKnbaYDo8ExRVBIIdwIGrqpxHuKttA==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=14" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/core": { + "version": "1.30.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-1.30.1.tgz", + "integrity": "sha512-OOCM2C/QIURhJMuKaekP3TRBxBKxG/TWWA0TL2J6nXUtDnuCtccy49LUJF8xPFXMX+0LMcxFpCo8M9cGY1W6rQ==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "1.28.0" + }, + "engines": { + "node": ">=14" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/resources": { + "version": "1.30.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-1.30.1.tgz", + "integrity": "sha512-5UxZqiAgLYGFjS4s9qm5mBVo433u+dSPUFWVWXmLAD4wB65oMCoXaJP1KJa9DIYYMeHu3z4BZcStG3LC593cWA==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "1.30.1", + "@opentelemetry/semantic-conventions": "1.28.0" + }, + "engines": { + "node": ">=14" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-trace-base": { + "version": "1.30.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-trace-base/-/sdk-trace-base-1.30.1.tgz", + "integrity": "sha512-jVPgBbH1gCy2Lb7X0AVQ8XAfgg0pJ4nvl8/IiQA6nxOsPvS+0zMJaFSs2ltXe0J6C8dqjcnpyqINDJmU30+uOg==", + "dev": true, + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "1.30.1", + "@opentelemetry/resources": "1.30.1", + "@opentelemetry/semantic-conventions": "1.28.0" + }, + "engines": { + "node": ">=14" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/semantic-conventions": { + "version": "1.28.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/semantic-conventions/-/semantic-conventions-1.28.0.tgz", + "integrity": "sha512-lp4qAiMTD4sNWW4DbKLBkfiMZ4jbAboJIGOQr5DvciMRI494OapieI9qiODpOt0XBr1LjIDy1xAGAnVs5supTA==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">=14" + } + }, "node_modules/@oxc-minify/binding-android-arm64": { "version": "0.93.0", "resolved": "https://registry.npmjs.org/@oxc-minify/binding-android-arm64/-/binding-android-arm64-0.93.0.tgz", diff --git a/package.json b/package.json index d5183aa..cee1f6f 100644 --- a/package.json +++ b/package.json @@ -85,11 +85,20 @@ "typescript": "^5.9.3" }, "peerDependencies": { + "@opentelemetry/api": "^1.9.0", "pg": "^8.0.0" }, + "peerDependenciesMeta": { + "@opentelemetry/api": { + "optional": true + } + }, "devDependencies": { "@biomejs/biome": "^2.3.10", "@electric-sql/pglite": "^0.3.14", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/context-async-hooks": "^1.27.0", + "@opentelemetry/sdk-trace-base": "^1.27.0", "@types/node": "^22.10.2", "@types/pg": "^8.11.10", "bunup": "^0.16.11", From 3cb3becfb205bee9087563868273391297dac5c0 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 08:31:32 +0300 Subject: [PATCH 04/21] feat(types): add wrap hook and context arg to WorkflowPlugin --- src/engine.ts | 10 ++++++---- src/types.ts | 10 ++++++++-- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/src/engine.ts b/src/engine.ts index 2c06058..992faa4 100644 --- a/src/engine.ts +++ b/src/engine.ts @@ -1123,10 +1123,6 @@ export class WorkflowEngine { let step = { ...baseStep }; const plugins = workflow.plugins ?? []; - for (const plugin of plugins) { - const extra = plugin.methods(step); - step = { ...step, ...extra }; - } const context: WorkflowContext = { input: run.input as InferInputParameters, @@ -1141,6 +1137,12 @@ export class WorkflowEngine { step, }; + for (const plugin of plugins) { + const extra = plugin.methods(step, context); + step = { ...step, ...extra }; + context.step = step; + } + const result = await workflow.handler(context); run = await this.getRun({ runId, resourceId: scopedResourceId }); diff --git a/src/types.ts b/src/types.ts index 618aa4b..0c85210 100644 --- a/src/types.ts +++ b/src/types.ts @@ -95,8 +95,14 @@ export type StepBaseContext = { * @template TStepExt - The extra methods this plugin adds to step. */ export interface WorkflowPlugin { - name: string; - methods: (step: TStepBase) => TStepExt; + name: string + methods: (step: TStepBase, context: WorkflowContext) => TStepExt + /** + * Optional middleware around the workflow handler call. Composes in + * registration order — the first plugin passed to `.use()` wraps everything + * inside. Implementations MUST call `next()` exactly once. + */ + wrap?: (context: WorkflowContext, next: () => Promise) => Promise } export type WorkflowContext< From 1c54bdc8de898d3996d0b3ca947986d5415eecbe Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 09:07:53 +0300 Subject: [PATCH 05/21] feat(engine): compose plugin.wrap middleware around handler Build a wrap chain from each plugin's optional wrap field in reverse registration order so that the first-registered plugin is outermost. Add a TDD test asserting the exact before/after call order. Co-Authored-By: Claude Sonnet 4.6 --- src/engine.test.ts | 55 ++++++++++++++++++++++++++++++++++++++++++++++ src/engine.ts | 11 +++++++++- 2 files changed, 65 insertions(+), 1 deletion(-) diff --git a/src/engine.test.ts b/src/engine.test.ts index f0c78b2..ed06663 100644 --- a/src/engine.test.ts +++ b/src/engine.test.ts @@ -279,6 +279,61 @@ describe('WorkflowEngine', () => { await engine.stop(); }); + + it('should call plugin.wrap around the handler and compose multiple wraps in registration order', async () => { + const calls: string[] = []; + + const outerPlugin: WorkflowPlugin = { + name: 'outer', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('outer:before'); + const result = await next(); + calls.push('outer:after'); + return result; + }, + }; + + const innerPlugin: WorkflowPlugin = { + name: 'inner', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('inner:before'); + const result = await next(); + calls.push('inner:after'); + return result; + }, + }; + + const engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }); + await engine.start(); + + const wrapped = workflow.use(outerPlugin).use(innerPlugin)( + 'wrap-order-workflow', + async ({ step }) => { + calls.push('handler'); + await step.run('only-step', async () => 'ok'); + return 'done'; + }, + ); + + await engine.registerWorkflow(wrapped); + const run = await engine.startWorkflow({ workflowId: 'wrap-order-workflow', input: {} }); + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + expect(calls).toEqual([ + 'outer:before', + 'inner:before', + 'handler', + 'inner:after', + 'outer:after', + ]); + + await engine.stop(); + }); }); describe('unregisterWorkflow()', () => { diff --git a/src/engine.ts b/src/engine.ts index 992faa4..f443e2e 100644 --- a/src/engine.ts +++ b/src/engine.ts @@ -1143,7 +1143,16 @@ export class WorkflowEngine { context.step = step; } - const result = await workflow.handler(context); + let next: () => Promise = () => workflow.handler(context); + for (const plugin of [...plugins].reverse()) { + if (plugin.wrap) { + const inner = next; + const wrap = plugin.wrap; + next = () => wrap(context, inner); + } + } + + const result = await next(); run = await this.getRun({ runId, resourceId: scopedResourceId }); From 1d62ef03dc0d2c4df57e3290ea2ad05ddefe35e3 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 15:15:46 +0300 Subject: [PATCH 06/21] test: add OTel test bootstrap helper --- src/plugins/otel-test-helpers.ts | 44 ++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 src/plugins/otel-test-helpers.ts diff --git a/src/plugins/otel-test-helpers.ts b/src/plugins/otel-test-helpers.ts new file mode 100644 index 0000000..1524543 --- /dev/null +++ b/src/plugins/otel-test-helpers.ts @@ -0,0 +1,44 @@ +import { context, type Tracer, trace } from '@opentelemetry/api'; +import { AsyncHooksContextManager } from '@opentelemetry/context-async-hooks'; +import { + BasicTracerProvider, + InMemorySpanExporter, + type ReadableSpan, + SimpleSpanProcessor, +} from '@opentelemetry/sdk-trace-base'; + +/** + * Build a fresh tracer + in-memory exporter for a single test. + * Callers MUST invoke `teardown()` in `afterEach`. + */ +export function setupOtel(): { + tracer: Tracer; + getSpans: () => ReadableSpan[]; + getSpansByName: (name: string) => ReadableSpan[]; + teardown: () => Promise; +} { + const exporter = new InMemorySpanExporter(); + const provider = new BasicTracerProvider({ + spanProcessors: [new SimpleSpanProcessor(exporter)], + }); + + // AsyncHooks context manager is required for nested step spans to attach + // to the workflow.run span across `await` boundaries. We register it + // globally because OTel's context API reads from the global manager. + const contextManager = new AsyncHooksContextManager().enable(); + context.setGlobalContextManager(contextManager); + + const tracer = provider.getTracer('pg-workflows-test'); + + return { + tracer, + getSpans: () => exporter.getFinishedSpans(), + getSpansByName: (name: string) => exporter.getFinishedSpans().filter((s) => s.name === name), + teardown: async () => { + await provider.shutdown(); + contextManager.disable(); + context.disable(); + trace.disable(); + }, + }; +} From a97b13b2bcf200e1940d941337f22dc6f9d4f3ef Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 15:44:15 +0300 Subject: [PATCH 07/21] feat(otel): plugin skeleton --- src/plugins/otel.test.ts | 49 ++++++++++++++++++++++++++++++++++++++++ src/plugins/otel.ts | 22 ++++++++++++++++++ 2 files changed, 71 insertions(+) create mode 100644 src/plugins/otel.test.ts create mode 100644 src/plugins/otel.ts diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts new file mode 100644 index 0000000..330cd7d --- /dev/null +++ b/src/plugins/otel.test.ts @@ -0,0 +1,49 @@ +import type pg from 'pg'; +import type { PgBoss } from 'pg-boss'; +import { afterAll, afterEach, beforeAll, beforeEach, describe, expect, it } from 'vitest'; +import { workflow } from '../definition'; +import { WorkflowEngine } from '../engine'; +import { getBoss } from '../tests/pgboss'; +import { closeTestDatabase, createTestDatabase } from '../tests/test-db'; +import { WorkflowStatus } from '../types'; +import { otelPlugin } from './otel'; +import { setupOtel } from './otel-test-helpers'; + +let testBoss: PgBoss; +let testPool: pg.Pool; + +beforeAll(async () => { + testPool = await createTestDatabase(); + testBoss = await getBoss(testPool); +}); + +afterAll(async () => { + await closeTestDatabase(); +}); + +describe('otelPlugin', () => { + let otel: ReturnType; + let engine: WorkflowEngine; + + beforeEach(async () => { + otel = setupOtel(); + engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }); + await engine.start(); + }); + + afterEach(async () => { + await engine.stop(); + await otel.teardown(); + }); + + it('registers and lets a workflow complete', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))('otel-smoke', async ({ step }) => { + return await step.run('only', async () => 'ok'); + }); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-smoke', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED, output: 'ok' }); + }); +}); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts new file mode 100644 index 0000000..00049a1 --- /dev/null +++ b/src/plugins/otel.ts @@ -0,0 +1,22 @@ +import type { Tracer } from '@opentelemetry/api'; +import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types'; + +export type OtelPluginOptions = { + /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ + tracer?: Tracer; + /** Prefix for all span names. Defaults to `pg_workflows`. */ + spanNamePrefix?: string; + /** Extra attributes merged onto the workflow.run span. */ + attributes?: (context: WorkflowContext) => Record; +}; + +const DEFAULT_PREFIX = 'pg_workflows'; + +export function otelPlugin( + _options: OtelPluginOptions = {}, +): WorkflowPlugin { + return { + name: 'opentelemetry', + methods: () => ({}), + }; +} From 6c50c3e7ae7135d9d965f6a1b85c4ec42467e604 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 15:48:49 +0300 Subject: [PATCH 08/21] feat(types): expose resourceId and attempt on WorkflowContext --- src/engine.ts | 2 ++ src/types.ts | 10 +++++++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/src/engine.ts b/src/engine.ts index f443e2e..8c3f14a 100644 --- a/src/engine.ts +++ b/src/engine.ts @@ -1128,6 +1128,8 @@ export class WorkflowEngine { input: run.input as InferInputParameters, workflowId: run.workflowId, runId: run.id, + resourceId: run.resourceId ?? undefined, + attempt: run.retryCount, get timeline() { // Read through to the live run so callers see entries written by // previously completed steps within the same handler invocation. diff --git a/src/types.ts b/src/types.ts index 0c85210..48452ae 100644 --- a/src/types.ts +++ b/src/types.ts @@ -95,14 +95,14 @@ export type StepBaseContext = { * @template TStepExt - The extra methods this plugin adds to step. */ export interface WorkflowPlugin { - name: string - methods: (step: TStepBase, context: WorkflowContext) => TStepExt + name: string; + methods: (step: TStepBase, context: WorkflowContext) => TStepExt; /** * Optional middleware around the workflow handler call. Composes in * registration order — the first plugin passed to `.use()` wraps everything * inside. Implementations MUST call `next()` exactly once. */ - wrap?: (context: WorkflowContext, next: () => Promise) => Promise + wrap?: (context: WorkflowContext, next: () => Promise) => Promise; } export type WorkflowContext< @@ -113,6 +113,10 @@ export type WorkflowContext< step: TStep; workflowId: string; runId: string; + /** Tenant/scope identifier set when the run was started, if any. */ + resourceId?: string; + /** Zero-based retry attempt number (= `run.retryCount`). */ + attempt: number; timeline: Record; logger: WorkflowLogger; }; From aae27e581c78740892d45091bbdabac674f2725d Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 15:58:47 +0300 Subject: [PATCH 09/21] feat(otel): emit workflow.run span via wrap hook --- src/plugins/otel.test.ts | 23 +++++++++++++++++++++++ src/plugins/otel.ts | 34 +++++++++++++++++++++++++++++++--- 2 files changed, 54 insertions(+), 3 deletions(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 330cd7d..4b9e383 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -46,4 +46,27 @@ describe('otelPlugin', () => { .poll(async () => await engine.getRun({ runId: run.id })) .toMatchObject({ status: WorkflowStatus.COMPLETED, output: 'ok' }); }); + + it('emits a workflow.run span on successful completion', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))('otel-wf-span', async () => 'done'); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ + resourceId: 'tenant-1', + workflowId: 'otel-wf-span', + input: {}, + }); + await expect + .poll(async () => await engine.getRun({ runId: run.id, resourceId: 'tenant-1' })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + const spans = otel.getSpansByName('pg_workflows.workflow.run'); + expect(spans).toHaveLength(1); + expect(spans[0].attributes).toMatchObject({ + 'workflow.id': 'otel-wf-span', + 'workflow.run_id': run.id, + 'workflow.resource_id': 'tenant-1', + 'workflow.attempt': 0, + }); + expect(spans[0].status.code).toBe(1); // SpanStatusCode.OK + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 00049a1..62adba8 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -1,4 +1,4 @@ -import type { Tracer } from '@opentelemetry/api'; +import { type AttributeValue, SpanStatusCode, type Tracer, trace } from '@opentelemetry/api'; import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types'; export type OtelPluginOptions = { @@ -7,16 +7,44 @@ export type OtelPluginOptions = { /** Prefix for all span names. Defaults to `pg_workflows`. */ spanNamePrefix?: string; /** Extra attributes merged onto the workflow.run span. */ - attributes?: (context: WorkflowContext) => Record; + attributes?: (context: WorkflowContext) => Record; }; const DEFAULT_PREFIX = 'pg_workflows'; export function otelPlugin( - _options: OtelPluginOptions = {}, + options: OtelPluginOptions = {}, ): WorkflowPlugin { + const tracer = options.tracer ?? trace.getTracer('pg-workflows'); + const prefix = options.spanNamePrefix ?? DEFAULT_PREFIX; + const extraAttrs = options.attributes; + return { name: 'opentelemetry', + methods: () => ({}), + + wrap: (context, next) => + tracer.startActiveSpan( + `${prefix}.workflow.run`, + { + attributes: { + 'workflow.id': context.workflowId, + 'workflow.run_id': context.runId, + 'workflow.attempt': context.attempt, + ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), + ...(extraAttrs ? extraAttrs(context) : {}), + }, + }, + async (span) => { + try { + const result = await next(); + span.setStatus({ code: SpanStatusCode.OK }); + return result; + } finally { + span.end(); + } + }, + ), }; } From 33eeb34acc7edeb7f10ab810354c35891fce0c6c Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 16:24:27 +0300 Subject: [PATCH 10/21] feat(otel): record exception on workflow.run span on failure --- src/plugins/otel.test.ts | 22 ++++++++++++++++++++++ src/plugins/otel.ts | 5 +++++ 2 files changed, 27 insertions(+) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 4b9e383..932b525 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -69,4 +69,26 @@ describe('otelPlugin', () => { }); expect(spans[0].status.code).toBe(1); // SpanStatusCode.OK }); + + it('records exception and ERROR status on workflow.run when handler throws', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-wf-throw', + async ({ step }) => { + await step.run('boom', async () => { + throw new Error('kaboom'); + }); + }, + { retries: 0 }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-wf-throw', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.FAILED }); + + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0]; + expect(wfSpan.status.code).toBe(2); // SpanStatusCode.ERROR + expect(wfSpan.status.message).toBe('kaboom'); + expect(wfSpan.events.some((e) => e.name === 'exception')).toBe(true); + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 62adba8..58e37fa 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -41,6 +41,11 @@ export function otelPlugin( const result = await next(); span.setStatus({ code: SpanStatusCode.OK }); return result; + } catch (err) { + const error = err instanceof Error ? err : new Error(String(err)); + span.recordException(error); + span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); + throw err; } finally { span.end(); } From 98f24938e2eba9bfeae77147601a5b575921b222 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 16:37:26 +0300 Subject: [PATCH 11/21] feat(otel): wrap step.run with span, cache-hit suppression, error path Co-Authored-By: Claude Sonnet 4.6 --- src/plugins/otel.test.ts | 77 ++++++++++++++++++++++++++++++++++++++++ src/plugins/otel.ts | 66 ++++++++++++++++++++++++++++++++-- 2 files changed, 141 insertions(+), 2 deletions(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 932b525..ce15bff 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -91,4 +91,81 @@ describe('otelPlugin', () => { expect(wfSpan.status.message).toBe('kaboom'); expect(wfSpan.events.some((e) => e.name === 'exception')).toBe(true); }); + + it('emits step.run span as a child of workflow.run', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-step-run-child', + async ({ step }) => { + return await step.run('foo', async () => 'bar'); + }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-step-run-child', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0]; + const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0]; + expect(stepSpan).toBeDefined(); + expect(stepSpan.attributes).toMatchObject({ 'step.id': 'foo', 'step.type': 'run' }); + expect(stepSpan.parentSpanId).toBe(wfSpan.spanContext().spanId); + }); + + it('skips step.run span on cache-hit replay', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-cache-skip', + async ({ step }) => { + const a = await step.run('first', async () => 'A'); + await step.waitFor('gate', { eventName: 'go' }); + const b = await step.run('second', async () => 'B'); + return { a, b }; + }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-cache-skip', input: {} }); + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.PAUSED }); + + // First execution: workflow.run + step.run('first') + step.waitFor('gate') + expect( + otel.getSpansByName('pg_workflows.step.run').map((s) => s.attributes['step.id']), + ).toEqual(['first']); + + await engine.triggerEvent({ runId: run.id, eventName: 'go' }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + // Second execution: NEW workflow.run + step.run('second') only. + // 'first' is a cache hit and emits no span. + const stepRunSpans = otel.getSpansByName('pg_workflows.step.run'); + const ids = stepRunSpans.map((s) => s.attributes['step.id']); + expect(ids).toEqual(['first', 'second']); + expect(otel.getSpansByName('pg_workflows.workflow.run')).toHaveLength(2); + }); + + it('records exception and ERROR status on step.run when handler throws', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-step-throw', + async ({ step }) => { + await step.run('explode', async () => { + throw new Error('nope'); + }); + }, + { retries: 0 }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-step-throw', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.FAILED }); + + const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0]; + expect(stepSpan.status.code).toBe(2); + expect(stepSpan.status.message).toBe('nope'); + expect(stepSpan.events.some((e) => e.name === 'exception')).toBe(true); + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 58e37fa..e13070f 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -1,4 +1,10 @@ -import { type AttributeValue, SpanStatusCode, type Tracer, trace } from '@opentelemetry/api'; +import { + type AttributeValue, + context as otelContext, + SpanStatusCode, + type Tracer, + trace, +} from '@opentelemetry/api'; import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types'; export type OtelPluginOptions = { @@ -12,6 +18,19 @@ export type OtelPluginOptions = { const DEFAULT_PREFIX = 'pg_workflows'; +function isCachedHit(timeline: Record, stepId: string): boolean { + const entry = timeline[stepId]; + if ( + entry && + typeof entry === 'object' && + 'output' in entry && + (entry as { output: unknown }).output !== undefined + ) { + return true; + } + return false; +} + export function otelPlugin( options: OtelPluginOptions = {}, ): WorkflowPlugin { @@ -22,7 +41,50 @@ export function otelPlugin( return { name: 'opentelemetry', - methods: () => ({}), + methods: (step, context) => ({ + run: async (stepId: string, handler: () => Promise) => { + if (isCachedHit(context.timeline, stepId)) { + return step.run(stepId, handler); + } + + // Capture the active context (workflow.run span) before the async step runs. + // We emit the span only if the step actually ran (result !== undefined). + // If the base step skips execution (workflow paused/cancelled), it returns + // undefined and we suppress the span to avoid noise on replay paths. + const capturedCtx = otelContext.active(); + let result: T | undefined; + let thrownError: Error | undefined; + + try { + result = await step.run(stepId, handler); + } catch (err) { + thrownError = err instanceof Error ? err : new Error(String(err)); + } + + if (result === undefined && !thrownError) { + // Step was skipped (workflow is paused/cancelled/failed) — no span. + return undefined as T; + } + + // Step ran or threw — emit a span with correct parent. + const span = tracer.startSpan( + `${prefix}.step.run`, + { attributes: { 'step.id': stepId, 'step.type': 'run' } }, + capturedCtx, + ); + + if (thrownError) { + span.recordException(thrownError); + span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); + span.end(); + throw thrownError; + } + + span.setStatus({ code: SpanStatusCode.OK }); + span.end(); + return result as T; + }, + }), wrap: (context, next) => tracer.startActiveSpan( From 57b291cbbe1ab5c184b9b5ef404a8ee14d70edec Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 16:44:21 +0300 Subject: [PATCH 12/21] fix(otel): preserve step.run span duration and original error throw - Capture startTime before awaiting step.run so spans reflect actual step execution time instead of near-zero post-completion duration. - Save originalErr and re-throw it (not the coerced Error), matching the wrap hook pattern and preserving non-Error throw values. - Add test asserting step.run span duration >= 30ms for a 50ms handler. Co-Authored-By: Claude Sonnet 4.6 --- src/plugins/otel.test.ts | 26 ++++++++++++++++++++++++++ src/plugins/otel.ts | 21 +++++++++++++-------- 2 files changed, 39 insertions(+), 8 deletions(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index ce15bff..d4db6ac 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -168,4 +168,30 @@ describe('otelPlugin', () => { expect(stepSpan.status.message).toBe('nope'); expect(stepSpan.events.some((e) => e.name === 'exception')).toBe(true); }); + + it('step.run span has non-zero duration matching the step handler runtime', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-step-duration', + async ({ step }) => { + return await step.run('slow', async () => { + await new Promise((resolve) => setTimeout(resolve, 50)); + return 'done'; + }); + }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-step-duration', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0]; + expect(stepSpan).toBeDefined(); + // Span duration = endTime - startTime in nanoseconds. With a 50ms sleep + // inside the handler, we expect at least ~30ms (allow generous margin). + const startNs = stepSpan.startTime[0] * 1_000_000_000 + stepSpan.startTime[1]; + const endNs = stepSpan.endTime[0] * 1_000_000_000 + stepSpan.endTime[1]; + const durationMs = (endNs - startNs) / 1_000_000; + expect(durationMs).toBeGreaterThan(30); + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index e13070f..9a421be 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -47,29 +47,34 @@ export function otelPlugin( return step.run(stepId, handler); } - // Capture the active context (workflow.run span) before the async step runs. - // We emit the span only if the step actually ran (result !== undefined). - // If the base step skips execution (workflow paused/cancelled), it returns - // undefined and we suppress the span to avoid noise on replay paths. + // Capture the active context (workflow.run span) and the start time + // BEFORE running the step, so the emitted span has correct timing. + // We materialise the span only if the step actually ran or threw — + // skipped steps (engine short-circuit on paused/cancelled runs) return + // undefined and produce no span. const capturedCtx = otelContext.active(); + const startTime = new Date(); let result: T | undefined; + let originalErr: unknown; let thrownError: Error | undefined; try { result = await step.run(stepId, handler); } catch (err) { + originalErr = err; thrownError = err instanceof Error ? err : new Error(String(err)); } if (result === undefined && !thrownError) { - // Step was skipped (workflow is paused/cancelled/failed) — no span. return undefined as T; } - // Step ran or threw — emit a span with correct parent. const span = tracer.startSpan( `${prefix}.step.run`, - { attributes: { 'step.id': stepId, 'step.type': 'run' } }, + { + startTime, + attributes: { 'step.id': stepId, 'step.type': 'run' }, + }, capturedCtx, ); @@ -77,7 +82,7 @@ export function otelPlugin( span.recordException(thrownError); span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); span.end(); - throw thrownError; + throw originalErr; } span.setStatus({ code: SpanStatusCode.OK }); From 3111e167da8368501d631336503a94f9727a22b2 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 17:39:22 +0300 Subject: [PATCH 13/21] feat(otel): wrap waitFor, delay, waitUntil, pause with spans Co-Authored-By: Claude Sonnet 4.6 --- src/plugins/otel.test.ts | 47 ++++++++++++++ src/plugins/otel.ts | 135 ++++++++++++++++++++++++++------------- 2 files changed, 139 insertions(+), 43 deletions(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index d4db6ac..63c7282 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -194,4 +194,51 @@ describe('otelPlugin', () => { const durationMs = (endNs - startNs) / 1_000_000; expect(durationMs).toBeGreaterThan(30); }); + + it('emits spans for waitFor, delay, waitUntil, pause', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-other-steps', + async ({ step }) => { + await step.waitFor('wf', { eventName: 'evt' }); + await step.delay('d', '1ms'); + await step.waitUntil('wu', new Date(Date.now() + 1)); + await step.pause('p'); + return 'ok'; + }, + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-other-steps', input: {} }); + + // Workflow pauses immediately on first waitFor; drive it through completion. + const drive = async (stepId?: string) => { + for (let i = 0; i < 40; i++) { + const r = await engine.getRun({ runId: run.id }); + if (r.status === WorkflowStatus.PAUSED && (!stepId || r.currentStepId === stepId)) break; + await new Promise((res) => setTimeout(res, 50)); + } + }; + await drive('wf'); + await engine.triggerEvent({ runId: run.id, eventName: 'evt' }); + // delay + waitUntil resolve themselves; wait until paused at the explicit pause step. + await drive('p'); + await engine.resumeWorkflow({ runId: run.id }); + await expect + .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + const stepNames = otel + .getSpans() + .map((s) => s.name) + .filter((n) => n.startsWith('pg_workflows.step.')); + expect(stepNames).toEqual( + expect.arrayContaining([ + 'pg_workflows.step.waitFor', + 'pg_workflows.step.delay', + 'pg_workflows.step.waitUntil', + 'pg_workflows.step.pause', + ]), + ); + const waitForSpan = otel.getSpansByName('pg_workflows.step.waitFor')[0]; + expect(waitForSpan.attributes).toMatchObject({ 'step.id': 'wf', 'step.type': 'waitFor' }); + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 9a421be..bed9f6d 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -41,55 +41,104 @@ export function otelPlugin( return { name: 'opentelemetry', - methods: (step, context) => ({ - run: async (stepId: string, handler: () => Promise) => { - if (isCachedHit(context.timeline, stepId)) { - return step.run(stepId, handler); - } + methods: (step, context) => { + const wrapVoidish = ( + kind: 'waitFor' | 'delay' | 'waitUntil' | 'pause', + base: (stepId: string, ...args: Args) => Promise, + ) => { + return async (stepId: string, ...args: Args): Promise => { + if (isCachedHit(context.timeline, stepId)) { + return base(stepId, ...args); + } + const capturedCtx = otelContext.active(); + const startTime = new Date(); + let result: R; + let originalErr: unknown; + let thrownError: Error | undefined; + try { + result = await base(stepId, ...args); + } catch (err) { + originalErr = err; + thrownError = err instanceof Error ? err : new Error(String(err)); + } + const span = tracer.startSpan( + `${prefix}.step.${kind}`, + { + startTime, + attributes: { 'step.id': stepId, 'step.type': kind }, + }, + capturedCtx, + ); + if (thrownError) { + span.recordException(thrownError); + span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); + span.end(); + throw originalErr; + } + span.setStatus({ code: SpanStatusCode.OK }); + span.end(); + // biome-ignore lint/style/noNonNullAssertion: result is assigned in try when not thrown + return result!; + }; + }; - // Capture the active context (workflow.run span) and the start time - // BEFORE running the step, so the emitted span has correct timing. - // We materialise the span only if the step actually ran or threw — - // skipped steps (engine short-circuit on paused/cancelled runs) return - // undefined and produce no span. - const capturedCtx = otelContext.active(); - const startTime = new Date(); - let result: T | undefined; - let originalErr: unknown; - let thrownError: Error | undefined; + return { + run: async (stepId: string, handler: () => Promise) => { + if (isCachedHit(context.timeline, stepId)) { + return step.run(stepId, handler); + } - try { - result = await step.run(stepId, handler); - } catch (err) { - originalErr = err; - thrownError = err instanceof Error ? err : new Error(String(err)); - } + // Capture the active context (workflow.run span) and the start time + // BEFORE running the step, so the emitted span has correct timing. + // We materialise the span only if the step actually ran or threw — + // skipped steps (engine short-circuit on paused/cancelled runs) return + // undefined and produce no span. + const capturedCtx = otelContext.active(); + const startTime = new Date(); + let result: T | undefined; + let originalErr: unknown; + let thrownError: Error | undefined; - if (result === undefined && !thrownError) { - return undefined as T; - } + try { + result = await step.run(stepId, handler); + } catch (err) { + originalErr = err; + thrownError = err instanceof Error ? err : new Error(String(err)); + } - const span = tracer.startSpan( - `${prefix}.step.run`, - { - startTime, - attributes: { 'step.id': stepId, 'step.type': 'run' }, - }, - capturedCtx, - ); + if (result === undefined && !thrownError) { + return undefined as T; + } - if (thrownError) { - span.recordException(thrownError); - span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); - span.end(); - throw originalErr; - } + const span = tracer.startSpan( + `${prefix}.step.run`, + { + startTime, + attributes: { 'step.id': stepId, 'step.type': 'run' }, + }, + capturedCtx, + ); - span.setStatus({ code: SpanStatusCode.OK }); - span.end(); - return result as T; - }, - }), + if (thrownError) { + span.recordException(thrownError); + span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); + span.end(); + throw originalErr; + } + + span.setStatus({ code: SpanStatusCode.OK }); + span.end(); + return result as T; + }, + waitFor: wrapVoidish('waitFor', step.waitFor as never) as StepBaseContext['waitFor'], + delay: wrapVoidish('delay', step.delay as never) as StepBaseContext['delay'], + waitUntil: wrapVoidish( + 'waitUntil', + step.waitUntil as never, + ) as StepBaseContext['waitUntil'], + pause: wrapVoidish('pause', step.pause as never) as StepBaseContext['pause'], + }; + }, wrap: (context, next) => tracer.startActiveSpan( From 21a266932a775a896cb05b735d77b588e4387641 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 17:46:49 +0300 Subject: [PATCH 14/21] feat(otel): wrap step.poll with span Co-Authored-By: Claude Sonnet 4.6 --- src/plugins/otel.test.ts | 36 ++++++++++++++++++++++++++++++++++++ src/plugins/otel.ts | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 63c7282..8b467d0 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -241,4 +241,40 @@ describe('otelPlugin', () => { const waitForSpan = otel.getSpansByName('pg_workflows.step.waitFor')[0]; expect(waitForSpan.attributes).toMatchObject({ 'step.id': 'wf', 'step.type': 'waitFor' }); }); + + it('emits step.poll span on each poll attempt', async () => { + let attempt = 0; + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))('otel-poll', async ({ step }) => { + const result = await step.poll( + 'poller', + async () => { + attempt += 1; + return attempt >= 2 ? { value: attempt } : false; + }, + { interval: '30s', timeout: '60s' }, + ); + return result; + }); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-poll', input: {} }); + + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.PAUSED }); + + // First execution emitted exactly one step.poll span + const firstPolls = otel.getSpansByName('pg_workflows.step.poll'); + expect(firstPolls).toHaveLength(1); + expect(firstPolls[0].attributes).toMatchObject({ 'step.id': 'poller', 'step.type': 'poll' }); + + // Simulate the poll-interval re-fire via fastForwardWorkflow + await engine.fastForwardWorkflow({ runId: run.id }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + // Second execution emits a new poll span (the previous one is not a cache hit + // because the step's *output* is not yet in timeline, only a poll-state entry) + expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2); + }); }); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index bed9f6d..e3e29fb 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -137,6 +137,41 @@ export function otelPlugin( step.waitUntil as never, ) as StepBaseContext['waitUntil'], pause: wrapVoidish('pause', step.pause as never) as StepBaseContext['pause'], + poll: (async ( + stepId: string, + conditionFn: () => Promise, + pollOptions?: Parameters[2], + ) => { + const capturedCtx = otelContext.active(); + const startTime = new Date(); + let result: Awaited> | undefined; + let originalErr: unknown; + let thrownError: Error | undefined; + try { + result = await step.poll(stepId, conditionFn, pollOptions); + } catch (err) { + originalErr = err; + thrownError = err instanceof Error ? err : new Error(String(err)); + } + const span = tracer.startSpan( + `${prefix}.step.poll`, + { + startTime, + attributes: { 'step.id': stepId, 'step.type': 'poll' }, + }, + capturedCtx, + ); + if (thrownError) { + span.recordException(thrownError); + span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); + span.end(); + throw originalErr; + } + span.setStatus({ code: SpanStatusCode.OK }); + span.end(); + // biome-ignore lint/style/noNonNullAssertion: result is assigned in try when not thrown + return result!; + }) as StepBaseContext['poll'], }; }, From 1d3e41dedfaf37c8c02a4cd7adceb91c607cc21b Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Thu, 21 May 2026 18:46:17 +0300 Subject: [PATCH 15/21] feat(otel): wrap step.invokeChildWorkflow with binding-aware cache check Adds a span for step.invokeChildWorkflow in the OTel plugin, emitting exactly one span per invocation (first execution only) by detecting both the cached-output case and the binding-key-only case (parent paused but child not yet complete) as cache hits on resume. Co-Authored-By: Claude Sonnet 4.6 --- src/plugins/otel.test.ts | 31 +++++++++++++++++++ src/plugins/otel.ts | 67 ++++++++++++++++++++++++++++++++++++++-- 2 files changed, 95 insertions(+), 3 deletions(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 8b467d0..d522119 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -242,6 +242,37 @@ describe('otelPlugin', () => { expect(waitForSpan.attributes).toMatchObject({ 'step.id': 'wf', 'step.type': 'waitFor' }); }); + it('emits invokeChildWorkflow span on creation and skips on cache-hit resume', async () => { + const child = workflow('otel-child', async ({ step }) => + step.run('done', async () => 'child-done'), + ); + await engine.registerWorkflow(child); + + const parent = workflow.use(otelPlugin({ tracer: otel.tracer }))( + 'otel-parent', + async ({ step }) => { + const r = await step.invokeChildWorkflow('call-child', { + workflowId: child.id, + input: {}, + }); + return r; + }, + ); + await engine.registerWorkflow(parent); + const run = await engine.startWorkflow({ workflowId: 'otel-parent', input: {} }); + + await expect + .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + const invokeSpans = otel.getSpansByName('pg_workflows.step.invokeChildWorkflow'); + expect(invokeSpans).toHaveLength(1); + expect(invokeSpans[0].attributes).toMatchObject({ + 'step.id': 'call-child', + 'step.type': 'invokeChildWorkflow', + }); + }); + it('emits step.poll span on each poll attempt', async () => { let attempt = 0; const w = workflow.use(otelPlugin({ tracer: otel.tracer }))('otel-poll', async ({ step }) => { diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index e3e29fb..752a1b6 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -5,6 +5,7 @@ import { type Tracer, trace, } from '@opentelemetry/api'; +import { invokeChildWorkflowTimelineKey } from '../constants'; import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types'; export type OtelPluginOptions = { @@ -18,7 +19,16 @@ export type OtelPluginOptions = { const DEFAULT_PREFIX = 'pg_workflows'; -function isCachedHit(timeline: Record, stepId: string): boolean { +type StepKind = + | 'run' + | 'waitFor' + | 'delay' + | 'waitUntil' + | 'pause' + | 'poll' + | 'invokeChildWorkflow'; + +function isCachedHit(timeline: Record, stepId: string, kind: StepKind): boolean { const entry = timeline[stepId]; if ( entry && @@ -28,6 +38,9 @@ function isCachedHit(timeline: Record, stepId: string): boolean ) { return true; } + if (kind === 'invokeChildWorkflow' && timeline[invokeChildWorkflowTimelineKey(stepId)]) { + return true; + } return false; } @@ -47,7 +60,7 @@ export function otelPlugin( base: (stepId: string, ...args: Args) => Promise, ) => { return async (stepId: string, ...args: Args): Promise => { - if (isCachedHit(context.timeline, stepId)) { + if (isCachedHit(context.timeline, stepId, kind)) { return base(stepId, ...args); } const capturedCtx = otelContext.active(); @@ -84,7 +97,7 @@ export function otelPlugin( return { run: async (stepId: string, handler: () => Promise) => { - if (isCachedHit(context.timeline, stepId)) { + if (isCachedHit(context.timeline, stepId, 'run')) { return step.run(stepId, handler); } @@ -172,6 +185,54 @@ export function otelPlugin( // biome-ignore lint/style/noNonNullAssertion: result is assigned in try when not thrown return result!; }) as StepBaseContext['poll'], + invokeChildWorkflow: (async ( + stepId: string, + refOrParams: Parameters[1], + inputArg?: unknown, + optionsArg?: unknown, + ) => { + if (isCachedHit(context.timeline, stepId, 'invokeChildWorkflow')) { + return (step.invokeChildWorkflow as (...args: unknown[]) => Promise)( + stepId, + refOrParams, + inputArg, + optionsArg, + ); + } + const capturedCtx = otelContext.active(); + const startTime = new Date(); + let result: unknown; + let originalErr: unknown; + let thrownError: Error | undefined; + try { + result = await (step.invokeChildWorkflow as (...args: unknown[]) => Promise)( + stepId, + refOrParams, + inputArg, + optionsArg, + ); + } catch (err) { + originalErr = err; + thrownError = err instanceof Error ? err : new Error(String(err)); + } + const span = tracer.startSpan( + `${prefix}.step.invokeChildWorkflow`, + { + startTime, + attributes: { 'step.id': stepId, 'step.type': 'invokeChildWorkflow' }, + }, + capturedCtx, + ); + if (thrownError) { + span.recordException(thrownError); + span.setStatus({ code: SpanStatusCode.ERROR, message: thrownError.message }); + span.end(); + throw originalErr; + } + span.setStatus({ code: SpanStatusCode.OK }); + span.end(); + return result; + }) as StepBaseContext['invokeChildWorkflow'], }; }, From 431254959cb572b2054e1e93886228d4b23242fb Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Fri, 22 May 2026 08:35:00 +0300 Subject: [PATCH 16/21] test(otel): direct coverage for isCachedHit predicate --- src/plugins/otel.test.ts | 29 +++++++++++++++++++++++++++++ src/plugins/otel.ts | 6 +++++- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index d522119..6e3b383 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -309,3 +309,32 @@ describe('otelPlugin', () => { expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2); }); }); + +import { invokeChildWorkflowTimelineKey } from '../constants'; +import { isCachedHit } from './otel'; + +describe('isCachedHit', () => { + it('returns true when output is recorded for stepId', () => { + expect(isCachedHit({ s: { output: 'x', timestamp: new Date() } }, 's', 'run')).toBe(true); + }); + + it('returns false when output is undefined', () => { + expect(isCachedHit({ s: { output: undefined, timestamp: new Date() } }, 's', 'run')).toBe( + false, + ); + }); + + it('returns false when timeline has no entry for stepId', () => { + expect(isCachedHit({}, 's', 'run')).toBe(false); + }); + + it('returns false for non-object entry', () => { + expect(isCachedHit({ s: 'not-an-object' }, 's', 'run')).toBe(false); + }); + + it('returns true for invokeChildWorkflow when only the binding key is present', () => { + const timeline = { [invokeChildWorkflowTimelineKey('s')]: { invokeChildWorkflow: {} } }; + expect(isCachedHit(timeline, 's', 'invokeChildWorkflow')).toBe(true); + expect(isCachedHit(timeline, 's', 'run')).toBe(false); + }); +}); diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 752a1b6..6844ca2 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -28,7 +28,11 @@ type StepKind = | 'poll' | 'invokeChildWorkflow'; -function isCachedHit(timeline: Record, stepId: string, kind: StepKind): boolean { +export function isCachedHit( + timeline: Record, + stepId: string, + kind: StepKind, +): boolean { const entry = timeline[stepId]; if ( entry && From d2352bbf67264d976f2e0cb0530202504002dac2 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Fri, 22 May 2026 08:57:59 +0300 Subject: [PATCH 17/21] test(otel): verify plugin composition order with another wrap --- src/plugins/otel.test.ts | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 6e3b383..179bbc5 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -5,6 +5,7 @@ import { workflow } from '../definition'; import { WorkflowEngine } from '../engine'; import { getBoss } from '../tests/pgboss'; import { closeTestDatabase, createTestDatabase } from '../tests/test-db'; +import type { StepBaseContext, WorkflowPlugin } from '../types'; import { WorkflowStatus } from '../types'; import { otelPlugin } from './otel'; import { setupOtel } from './otel-test-helpers'; @@ -308,6 +309,36 @@ describe('otelPlugin', () => { // because the step's *output* is not yet in timeline, only a poll-state entry) expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2); }); + + it('composes wrap with another plugin in registration order', async () => { + const calls: string[] = []; + const trackerPlugin: WorkflowPlugin = { + name: 'tracker', + methods: () => ({}), + wrap: async (_ctx, next) => { + calls.push('tracker:before'); + const r = await next(); + calls.push('tracker:after'); + return r; + }, + }; + + const w = workflow.use(trackerPlugin).use(otelPlugin({ tracer: otel.tracer }))( + 'otel-compose', + async () => 'ok', + ); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-compose', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id })) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + // tracker registered first, so its wrap is outermost — its before runs + // before the workflow.run span opens, and its after runs after the span ends. + const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0]; + expect(wfSpan).toBeDefined(); + expect(calls).toEqual(['tracker:before', 'tracker:after']); + }); }); import { invokeChildWorkflowTimelineKey } from '../constants'; From 399d19c4114264adeaab6168b32e691df87d91ec Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Fri, 22 May 2026 09:01:20 +0300 Subject: [PATCH 18/21] feat(otel): export otelPlugin and document usage --- AGENTS.md | 15 +++++++++++++++ README.md | 42 ++++++++++++++++++++++++++++++++++++++++++ src/index.ts | 1 + 3 files changed, 58 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index c3a66d3..9d20167 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -118,6 +118,21 @@ await engine.stop(); // graceful shutdown (also closes pool if engine **Dependencies**: `pg` is a peer dependency (you install it); `pg-boss` is a regular dependency (bundled, no install needed). +### `otelPlugin(options?)` - OpenTelemetry tracing + +```typescript +import { workflow, otelPlugin } from 'pg-workflows'; + +// Optional peer dep: install `@opentelemetry/api` and an OTel SDK (e.g. NodeSDK). +// One `pg_workflows.workflow.run` span per worker execution, with child spans +// per step kind. Spans replayed from cache after a pause are suppressed. +const tracedWorkflow = workflow.use(otelPlugin({ + // tracer?: Tracer // default: trace.getTracer('pg-workflows') + // spanNamePrefix?: string // default: 'pg_workflows' + // attributes?: (ctx) => Record +})); +``` + ### Step Types (available on `context.step`) #### `step.run(stepId, handler)` - Execute a durable step diff --git a/README.md b/README.md index a5a8993..8f69123 100644 --- a/README.md +++ b/README.md @@ -157,6 +157,48 @@ See [runnable examples](https://github.com/SokratisVidros/pg-workflows/tree/main --- +## Observability with OpenTelemetry + +pg-workflows ships a first-party plugin that emits OTel spans for workflow and step execution. `@opentelemetry/api` is an optional peer dependency — install it only if you want tracing. + +```bash +npm install @opentelemetry/api @opentelemetry/sdk-node +``` + +```ts +import { NodeSDK } from '@opentelemetry/sdk-node' +import { trace } from '@opentelemetry/api' +import { workflow, otelPlugin } from 'pg-workflows' + +// Initialize your OTel SDK however you normally do — for Node apps the +// NodeSDK registers an AsyncHooks context manager, which is required for +// hierarchical (parent/child) spans across async boundaries. +new NodeSDK({ /* exporters, resource, ... */ }).start() + +const tracedWorkflow = workflow.use(otelPlugin()) + +const myWorkflow = tracedWorkflow('checkout', async ({ step }) => { + await step.run('charge', async () => { /* ... */ }) + await step.waitFor('await-shipment', { eventName: 'shipped' }) +}) +``` + +The plugin emits a `pg_workflows.workflow.run` span per worker execution (one per resume cycle), with child spans per step kind (`pg_workflows.step.run`, `pg_workflows.step.waitFor`, etc.). Spans carry `workflow.id`, `workflow.run_id`, `workflow.attempt` and, where set, `workflow.resource_id`. Steps replayed from cache after a pause emit no spans. + +**Options:** + +```ts +otelPlugin({ + tracer: trace.getTracer('my-app'), // default: trace.getTracer('pg-workflows') + spanNamePrefix: 'pg_workflows', // default shown + attributes: (ctx) => ({ tenant: ctx.resourceId }), // extra static attrs on workflow.run +}) +``` + +Metrics, distributed trace context propagation across child workflows, and HTTP-caller context propagation are not in v1 — see [the design doc](docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md) for the deferral rationale. + +--- + ## Requirements - Node.js >= 18 diff --git a/src/index.ts b/src/index.ts index 4090c25..f3e6aaa 100644 --- a/src/index.ts +++ b/src/index.ts @@ -7,6 +7,7 @@ export { createWorkflowRef, workflow } from './definition'; export type { Duration } from './duration'; export { WorkflowEngine, type WorkflowEngineOptions } from './engine'; export { WorkflowEngineError, WorkflowRunNotFoundError } from './error'; +export { type OtelPluginOptions, otelPlugin } from './plugins/otel'; export type { InferInputParameters, InputParameters, From 0573c9f4ed8af63a2f519a04e12e7acdb38ce59f Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Fri, 22 May 2026 09:08:42 +0300 Subject: [PATCH 19/21] fix(otel): wrap step.sleep alias and align span names with implementation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit step.sleep was not wrapped by the OTel plugin because spreading baseStep copies the getter's value, not the getter itself — so sleep pointed to the unwrapped delay. Added sleep to the methods return object, reusing the 'delay' kind for semantic consistency. Added a unit test that verifies step.sleep emits a pg_workflows.step.delay span. Also corrected all snake_case span names in the OTel design spec (wait_for, wait_until, invoke_child_workflow) to camelCase (waitFor, waitUntil, invokeChildWorkflow) to match the implementation. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-21-otel-instrumentation-design.md | 16 ++++++++-------- src/plugins/otel.test.ts | 17 +++++++++++++++++ src/plugins/otel.ts | 1 + 3 files changed, 26 insertions(+), 8 deletions(-) diff --git a/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md b/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md index 4fc6d31..2745165 100644 --- a/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md +++ b/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md @@ -13,7 +13,7 @@ Allow pg-workflows users to emit OpenTelemetry traces for workflow and step exec **In scope:** - A first-party plugin, `otelPlugin`, shipped from the `pg-workflows` package. -- A `workflow.run` span per worker execution of a workflow run, with child spans for each step kind (`step.run`, `step.wait_for`, `step.delay`, `step.wait_until`, `step.pause`, `step.poll`, `step.invoke_child_workflow`). +- A `workflow.run` span per worker execution of a workflow run, with child spans for each step kind (`step.run`, `step.waitFor`, `step.delay`, `step.waitUntil`, `step.pause`, `step.poll`, `step.invokeChildWorkflow`). - Hierarchical traces via OpenTelemetry's AsyncLocalStorage active context (no manual context plumbing in user workflows). - Suppression of spans for cache-hit step replays. - Optional peer dependency on `@opentelemetry/api`. Non-users pay zero cost. @@ -97,19 +97,19 @@ const tracedWorkflow = workflow.use(otelPlugin({ ``` pg_workflows.workflow.run ├── pg_workflows.step.run -├── pg_workflows.step.wait_for +├── pg_workflows.step.waitFor ├── pg_workflows.step.delay -├── pg_workflows.step.wait_until +├── pg_workflows.step.waitUntil ├── pg_workflows.step.pause ├── pg_workflows.step.poll -└── pg_workflows.step.invoke_child_workflow +└── pg_workflows.step.invokeChildWorkflow ``` | Span | Attributes | | ------------------------------------ | --------------------------------------------------------------------------------------------------- | | `workflow.run` | `workflow.id`, `workflow.run_id`, `workflow.resource_id` (if present), `workflow.attempt` (= `run.retryCount`), plus anything from the user's `attributes(ctx)` callback | | `step.` (all kinds) | `step.id`, `step.type` (matches the `StepType` enum value) | -| `step.invoke_child_workflow` | Plus `child.workflow_id`, `child.run_id` once the child run has been created | +| `step.invokeChildWorkflow` | Plus `child.workflow_id`, `child.run_id` once the child run has been created | | Any span on error | `recordException(err)`, `setStatus({ code: ERROR, message })` | ### Cache-hit suppression @@ -159,9 +159,9 @@ Test setup registers a `BasicTracerProvider` with an `InMemorySpanExporter` once Cases: 1. **Single-step happy path** — one `step.run` produces exactly 2 spans: `workflow.run` parent + `step.run` child. Attributes match. Both `OK`. -2. **Multi-step with pause** — workflow runs `step1.run` → `step2.waitFor`. First execution emits `workflow.run` + `step1.run` + `step2.wait_for`. `triggerEvent` resumes; second execution emits a new `workflow.run` trace containing only the post-pause work (cached `step1` and the resumed `step2` emit no spans). +2. **Multi-step with pause** — workflow runs `step1.run` → `step2.waitFor`. First execution emits `workflow.run` + `step1.run` + `step2.waitFor`. `triggerEvent` resumes; second execution emits a new `workflow.run` trace containing only the post-pause work (cached `step1` and the resumed `step2` emit no spans). 3. **Step throws** — `step.run`'s handler throws. The `step.run` span has `ERROR` status with a recorded exception. The error propagates so `run.error` is persisted and pg-boss retry semantics are unchanged. -4. **`invokeChildWorkflow` cache replay** — parent's `step.invoke_child_workflow` span is emitted on the pause execution. On the resume execution, the binding key is present and the cached output completes, so no span is emitted. +4. **`invokeChildWorkflow` cache replay** — parent's `step.invokeChildWorkflow` span is emitted on the pause execution. On the resume execution, the binding key is present and the cached output completes, so no span is emitted. 5. **Plugin composition order** — register a trivial second wrap plugin alongside `otelPlugin` (in both orders) and assert wraps compose in `.use()` registration order. 6. **Cache-hit predicate unit test** — direct test of the `isCachedHit` predicate against the timeline shapes produced by each step kind. @@ -185,7 +185,7 @@ The issue proposes `pg_workflows.workflow.started`, `pg_workflows.workflow.compl When a workflow pauses and resumes, the resume execution gets a fresh root span — there is no link to the previous execution's trace beyond shared `workflow.run_id` attributes. Linking them would require persisting the trace context (`traceparent` header value) somewhere durable, e.g. in `workflow_runs.timeline` or a dedicated column. -Same for `step.invoke_child_workflow`: child runs currently start a fresh root span rather than continuing the parent's trace. +Same for `step.invokeChildWorkflow`: child runs currently start a fresh root span rather than continuing the parent's trace. Both deferred together because they share the persistence design question. diff --git a/src/plugins/otel.test.ts b/src/plugins/otel.test.ts index 179bbc5..9615631 100644 --- a/src/plugins/otel.test.ts +++ b/src/plugins/otel.test.ts @@ -310,6 +310,23 @@ describe('otelPlugin', () => { expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2); }); + it('wraps step.sleep (alias for step.delay) with a span', async () => { + const w = workflow.use(otelPlugin({ tracer: otel.tracer }))('otel-sleep', async ({ step }) => { + await step.sleep('napping', '1ms'); + return 'ok'; + }); + await engine.registerWorkflow(w); + const run = await engine.startWorkflow({ workflowId: 'otel-sleep', input: {} }); + await expect + .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) + .toMatchObject({ status: WorkflowStatus.COMPLETED }); + + // sleep is an alias for delay — the span name should be pg_workflows.step.delay + // so users can search uniformly for "delay" spans regardless of the alias used. + const delaySpans = otel.getSpansByName('pg_workflows.step.delay'); + expect(delaySpans.some((s) => s.attributes['step.id'] === 'napping')).toBe(true); + }); + it('composes wrap with another plugin in registration order', async () => { const calls: string[] = []; const trackerPlugin: WorkflowPlugin = { diff --git a/src/plugins/otel.ts b/src/plugins/otel.ts index 6844ca2..334c189 100644 --- a/src/plugins/otel.ts +++ b/src/plugins/otel.ts @@ -149,6 +149,7 @@ export function otelPlugin( }, waitFor: wrapVoidish('waitFor', step.waitFor as never) as StepBaseContext['waitFor'], delay: wrapVoidish('delay', step.delay as never) as StepBaseContext['delay'], + sleep: wrapVoidish('delay', step.delay as never) as StepBaseContext['sleep'], waitUntil: wrapVoidish( 'waitUntil', step.waitUntil as never, From a93ace7092e10f36ab1b3c2364dd1a0f51b04a2a Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Fri, 22 May 2026 13:18:12 +0300 Subject: [PATCH 20/21] docs: replace internal spec/plan with public observability page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Move the OTel design and plan files out of the repo — they were development-process metadata, not user-facing docs. Their concrete output lives in src/plugins/otel.ts and is exercised by the test suite. Add a public docs/observability.md page covering span hierarchy, attributes, cache-hit semantics, plugin composition, options, error semantics, and explicit v1 deferrals. Wire the page into the README documentation index and fix the design-doc link in the Observability section. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 3 +- docs/observability.md | 117 ++ .../plans/2026-05-21-otel-instrumentation.md | 1577 ----------------- .../2026-05-21-otel-instrumentation-design.md | 202 --- 4 files changed, 119 insertions(+), 1780 deletions(-) create mode 100644 docs/observability.md delete mode 100644 docs/superpowers/plans/2026-05-21-otel-instrumentation.md delete mode 100644 docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md diff --git a/README.md b/README.md index 8f69123..f461c69 100644 --- a/README.md +++ b/README.md @@ -154,6 +154,7 @@ See [runnable examples](https://github.com/SokratisVidros/pg-workflows/tree/main - **[Examples](docs/examples.md)** - conditional steps, batch loops, scheduled reminders, retries, monitoring - **[API Reference](docs/api-reference.md)** - `WorkflowEngine`, `WorkflowClient`, `WorkflowRef`, types - **[Configuration](docs/configuration.md)** - env vars, database setup, requirements +- **[Observability](docs/observability.md)** - OpenTelemetry tracing via `otelPlugin` --- @@ -195,7 +196,7 @@ otelPlugin({ }) ``` -Metrics, distributed trace context propagation across child workflows, and HTTP-caller context propagation are not in v1 — see [the design doc](docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md) for the deferral rationale. +Metrics, distributed trace context propagation across child workflows, and HTTP-caller context propagation are not in v1 — see [the observability docs](docs/observability.md#not-in-v1) for the deferral rationale. --- diff --git a/docs/observability.md b/docs/observability.md new file mode 100644 index 0000000..ed631ae --- /dev/null +++ b/docs/observability.md @@ -0,0 +1,117 @@ +# Observability with OpenTelemetry + +pg-workflows ships a first-party `otelPlugin` that emits OpenTelemetry spans for workflow and step execution. `@opentelemetry/api` is an **optional peer dependency** — users who don't import the plugin pay zero runtime cost. + +## Quick start + +```bash +npm install @opentelemetry/api @opentelemetry/sdk-node +``` + +```ts +import { NodeSDK } from '@opentelemetry/sdk-node'; +import { workflow, otelPlugin } from 'pg-workflows'; + +// Initialize your OTel SDK however you normally do. NodeSDK registers an +// AsyncHooks context manager, which is required for hierarchical (parent/child) +// spans across `await` boundaries inside workflow handlers. +new NodeSDK({ /* exporters, resource, ... */ }).start(); + +const tracedWorkflow = workflow.use(otelPlugin()); + +const checkout = tracedWorkflow('checkout', async ({ step }) => { + await step.run('charge', async () => { /* ... */ }); + await step.waitFor('await-shipment', { eventName: 'shipped' }); +}); +``` + +## Span hierarchy + +Each worker execution of a workflow run produces one trace. A workflow that pauses (`step.waitFor`, `step.delay`, etc.) and resumes later produces a **new trace per resume cycle**. Traces are stitched together via the shared `workflow.id` and `workflow.run_id` attributes. + +``` +pg_workflows.workflow.run +├── pg_workflows.step.run +├── pg_workflows.step.waitFor +├── pg_workflows.step.delay +├── pg_workflows.step.waitUntil +├── pg_workflows.step.pause +├── pg_workflows.step.poll +└── pg_workflows.step.invokeChildWorkflow +``` + +`step.sleep` is an alias for `step.delay`; calls to it emit a `pg_workflows.step.delay` span (semantic consistency — both represent a sleep). + +## Attributes + +| Span | Attributes | +| ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | +| `pg_workflows.workflow.run` | `workflow.id`, `workflow.run_id`, `workflow.attempt` (= `run.retryCount`), `workflow.resource_id` (when set), plus any user-supplied attrs | +| `pg_workflows.step.` | `step.id`, `step.type` (matches the `StepType` enum value) | +| Any span on error | `recordException(err)`, `status.code = ERROR`, `status.message = err.message` | +| Any span on success | `status.code = OK` | + +## Cache-hit suppression + +When a workflow resumes after a pause, the handler re-runs from the top. Steps that completed in a prior execution return their cached output instantly. The plugin detects these cache-hit replays and **does not emit a span** for them. + +Detection is based on `context.timeline`: + +- A step has an output cached in the timeline (`timeline[stepId].output !== undefined`) → cache hit. +- `step.invokeChildWorkflow` additionally checks for the in-flight binding key (`__invokeChildWorkflow:`) — a parent run that re-enters this step during a resume-while-child-still-running cycle is also treated as a cache hit. + +Exception: `step.poll` does not use the cache-hit guard. Each handler invocation that reaches `step.poll` represents a meaningful poll attempt worth tracing. + +## Plugin composition + +The OTel plugin uses the same `wrap(context, next)` middleware hook that any plugin can implement. If you register multiple plugins via `workflow.use(...)`, their wraps compose in registration order — the first plugin's wrap is outermost. + +```ts +const w = workflow + .use(loggingPlugin) // outermost wrap + .use(otelPlugin()) // inner wrap (workflow.run span opens inside loggingPlugin) + ('checkout', async ({ step }) => { /* ... */ }); +``` + +## Options + +```ts +otelPlugin({ + // Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. + tracer: trace.getTracer('my-app'), + + // Span name prefix. Defaults to 'pg_workflows'. + spanNamePrefix: 'pg_workflows', + + // Optional callback returning extra attributes for the workflow.run span. + // Receives the WorkflowContext so you can extract values from the input + // or the run's resourceId. + attributes: (ctx) => ({ tenant: ctx.resourceId }), +}); +``` + +## Error semantics + +When a step or workflow handler throws: + +1. The span's exception is recorded via `span.recordException(error)`. +2. The span status is set to `ERROR` with the error's message. +3. The **original error** is re-thrown — engine retry/DLQ behaviour is unaffected. + +Non-`Error` throws (e.g., `throw 'msg'`) are coerced to an `Error` for the OTel API only; the original value is preserved on the re-throw path. + +## Not in v1 + +These are deliberately out of scope for the initial release. They share a common requirement (durable storage of trace context) and will likely land together when the underlying schema work is done. + +- **Metrics** (counters, histograms, gauges) — different OTel API surface; layers onto the same plugin hooks. +- **Cross-execution trace context propagation** — paused workflows resume as a fresh root trace today. Linking the resume to the prior execution requires persisting the `traceparent` header. +- **`step.invokeChildWorkflow` parent-trace continuation** — child runs start a fresh root trace. Same persistence question. +- **Caller context propagation into `engine.startWorkflow`** — an incoming HTTP trace does not currently propagate into the workflow run. +- **DLQ span emission** — `handleWorkflowRunDlq` runs outside the workflow's plugin chain. DLQ-induced FAILED states therefore don't produce a `workflow.run` span. The precipitating error is already recorded on the last per-execution span via the catch path. +- **Sampling controls** — the plugin defers to your configured `TracerProvider` for sampling. + +## Requirements + +- `@opentelemetry/api` ^1.9.0 (optional peer) +- An OTel SDK that registers an AsyncHooks context manager. `@opentelemetry/sdk-node`'s `NodeSDK` does this automatically. If you're wiring OTel manually, install `@opentelemetry/context-async-hooks` and call `context.setGlobalContextManager(new AsyncHooksContextManager().enable())`. diff --git a/docs/superpowers/plans/2026-05-21-otel-instrumentation.md b/docs/superpowers/plans/2026-05-21-otel-instrumentation.md deleted file mode 100644 index 8ffdbcd..0000000 --- a/docs/superpowers/plans/2026-05-21-otel-instrumentation.md +++ /dev/null @@ -1,1577 +0,0 @@ -# OpenTelemetry Instrumentation Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Ship a first-party `otelPlugin` that emits OpenTelemetry spans for workflow and step execution, with zero cost when not used. - -**Architecture:** Add an optional `wrap(context, next)` hook to `WorkflowPlugin` and pass `context` into `methods(step, context)`. The engine composes plugin wraps as middleware around the workflow handler. The OTel plugin opens one `workflow.run` span per execution via `wrap` and wraps every step method to open a child span, suppressing spans for cache-hit replays by inspecting `context.timeline`. - -**Tech Stack:** TypeScript ESM/CJS, Vitest (unit suite uses PGlite), Biome (no semicolons, single quotes), `@opentelemetry/api` (optional peer), `@opentelemetry/sdk-trace-base` + `@opentelemetry/context-async-hooks` (devDeps for tests). - -**Spec:** `docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md` - ---- - -## File Map - -- **Create:** `src/plugins/otel.ts` — the plugin (~120 LOC). -- **Create:** `src/plugins/otel.test.ts` — full test coverage (~300 LOC). -- **Create:** `src/plugins/otel-test-helpers.ts` — tracer/exporter bootstrap shared by tests. -- **Modify:** `src/types.ts` — extend `WorkflowPlugin` with `wrap?` and add `context` param to `methods`. -- **Modify:** `src/engine.ts` — pass `context` to `plugin.methods`, compose `plugin.wrap` chain around handler call. -- **Modify:** `src/index.ts` — export `otelPlugin`. -- **Modify:** `package.json` — `@opentelemetry/api` optional peer dep, plus devDeps for testing. -- **Modify:** `README.md` — add Observability section. -- **Modify:** `AGENTS.md` — bullet under Core API. - ---- - -## Task 1: Add OpenTelemetry dependencies - -**Files:** -- Modify: `package.json` - -- [ ] **Step 1: Add `@opentelemetry/api` as optional peer dep and add devDeps** - -Edit `package.json`. Add to `peerDependencies`: - -```json -"peerDependencies": { - "pg": "^8.0.0", - "@opentelemetry/api": "^1.9.0" -} -``` - -Add new top-level `peerDependenciesMeta`: - -```json -"peerDependenciesMeta": { - "@opentelemetry/api": { "optional": true } -} -``` - -Add to `devDependencies` (keep alphabetical order): - -```json -"@opentelemetry/api": "^1.9.0", -"@opentelemetry/context-async-hooks": "^1.27.0", -"@opentelemetry/sdk-trace-base": "^1.27.0" -``` - -- [ ] **Step 2: Install** - -Run: `npm install` -Expected: lockfile updates; no errors. - -- [ ] **Step 3: Verify the rest of the build still works** - -Run: `npm run build` -Expected: exits 0. - -- [ ] **Step 4: Commit** - -```bash -git add package.json package-lock.json -git commit -m "build: add OpenTelemetry deps for otelPlugin" -``` - ---- - -## Task 2: Extend `WorkflowPlugin` interface in types.ts - -**Files:** -- Modify: `src/types.ts:90-96` - -- [ ] **Step 1: Update `WorkflowPlugin` interface** - -In `src/types.ts`, replace the `WorkflowPlugin` interface: - -```ts -/** - * Plugin that extends the workflow step API with extra methods. - * @template TStepBase - The step type this plugin receives (base + previous plugins). - * @template TStepExt - The extra methods this plugin adds to step. - */ -export interface WorkflowPlugin { - name: string - methods: (step: TStepBase, context: WorkflowContext) => TStepExt - /** - * Optional middleware around the workflow handler call. Composes in - * registration order — the first plugin passed to `.use()` wraps everything - * inside. Implementations MUST call `next()` exactly once. - */ - wrap?: (context: WorkflowContext, next: () => Promise) => Promise -} -``` - -- [ ] **Step 2: Run typecheck** - -Run: `npx tsc --noEmit` -Expected: existing plugin tests in `src/engine.test.ts` still compile (their `methods: (step) => ({...})` is assignable to `(step, context) => ({...})` because TS allows passing fewer params). - -- [ ] **Step 3: Run unit suite** - -Run: `npm run test:unit` -Expected: all existing tests pass; no behavioural change yet. - -- [ ] **Step 4: Commit** - -```bash -git add src/types.ts -git commit -m "feat(types): add wrap hook and context arg to WorkflowPlugin" -``` - ---- - -## Task 3: Wire engine to pass context and compose wrap chain - -**Files:** -- Modify: `src/engine.ts:1124-1140` -- Modify: `src/engine.test.ts` (add wrap composition test) - -- [ ] **Step 1: Write the failing test for wrap composition** - -Append to `src/engine.test.ts` inside the `describe('workflow.use(plugin)', () => { ... })` block: - -```ts -it('should call plugin.wrap around the handler and compose multiple wraps in registration order', async () => { - const calls: string[] = [] - - const outerPlugin: WorkflowPlugin = { - name: 'outer', - methods: () => ({}), - wrap: async (_ctx, next) => { - calls.push('outer:before') - const result = await next() - calls.push('outer:after') - return result - }, - } - - const innerPlugin: WorkflowPlugin = { - name: 'inner', - methods: () => ({}), - wrap: async (_ctx, next) => { - calls.push('inner:before') - const result = await next() - calls.push('inner:after') - return result - }, - } - - const engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }) - await engine.start() - - const wrapped = workflow - .use(outerPlugin) - .use(innerPlugin)('wrap-order-workflow', async ({ step }) => { - calls.push('handler') - await step.run('only-step', async () => 'ok') - return 'done' - }) - - await engine.registerWorkflow(wrapped) - const run = await engine.startWorkflow({ workflowId: 'wrap-order-workflow', input: {} }) - - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - expect(calls).toEqual([ - 'outer:before', - 'inner:before', - 'handler', - 'inner:after', - 'outer:after', - ]) - - await engine.stop() -}) -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `npm run test:unit -- engine.test.ts -t "compose multiple wraps"` -Expected: FAIL (wrap is not invoked yet). - -- [ ] **Step 3: Modify `handleWorkflowRun` to pass context and compose wraps** - -In `src/engine.ts`, locate the block that currently reads (around lines 1124–1140): - -```ts - let step = { ...baseStep }; - const plugins = workflow.plugins ?? []; - for (const plugin of plugins) { - const extra = plugin.methods(step); - step = { ...step, ...extra }; - } - - const context: WorkflowContext = { - input: run.input as InferInputParameters, - workflowId: run.workflowId, - runId: run.id, - timeline: run.timeline, - logger: this.logger, - step, - }; - - const result = await workflow.handler(context); -``` - -Replace it with: - -```ts - const plugins = workflow.plugins ?? []; - - const context: WorkflowContext = { - input: run.input as InferInputParameters, - workflowId: run.workflowId, - runId: run.id, - timeline: run.timeline, - logger: this.logger, - // step is populated below once plugins.methods has run - step: baseStep as WorkflowContext['step'], - }; - - let step = { ...baseStep }; - for (const plugin of plugins) { - const extra = plugin.methods(step, context); - step = { ...step, ...extra }; - } - context.step = step as WorkflowContext['step']; - - let next: () => Promise = () => workflow.handler(context); - for (const plugin of [...plugins].reverse()) { - if (plugin.wrap) { - const inner = next; - const wrap = plugin.wrap; - next = () => wrap(context, inner); - } - } - - const result = await next(); -``` - -Rationale: -- `context` is constructed before `plugin.methods` runs so methods can read `context.timeline` for cache-hit detection. -- `context.step` is assigned the composed step API afterward (the same object the handler sees). -- The wrap chain is built bottom-up: the last plugin's wrap is innermost, the first plugin's wrap is outermost. Plugins without `wrap` are skipped. - -- [ ] **Step 4: Run the failing test to verify it now passes** - -Run: `npm run test:unit -- engine.test.ts -t "compose multiple wraps"` -Expected: PASS, with `calls` in the exact order asserted. - -- [ ] **Step 5: Run the full unit suite to confirm no regressions** - -Run: `npm run test:unit` -Expected: all pass. - -- [ ] **Step 6: Commit** - -```bash -git add src/engine.ts src/engine.test.ts -git commit -m "feat(engine): compose plugin.wrap middleware and pass context to methods" -``` - ---- - -## Task 4: Create OTel test helpers - -**Files:** -- Create: `src/plugins/otel-test-helpers.ts` - -- [ ] **Step 1: Create the helper module** - -Create `src/plugins/otel-test-helpers.ts`: - -```ts -import { AsyncHooksContextManager } from '@opentelemetry/context-async-hooks' -import { context, type Tracer, trace } from '@opentelemetry/api' -import { - BasicTracerProvider, - InMemorySpanExporter, - type ReadableSpan, - SimpleSpanProcessor, -} from '@opentelemetry/sdk-trace-base' - -/** - * Build a fresh tracer + in-memory exporter for a single test. - * Callers MUST invoke `teardown()` in `afterEach`. - */ -export function setupOtel(): { - tracer: Tracer - getSpans: () => ReadableSpan[] - getSpansByName: (name: string) => ReadableSpan[] - teardown: () => Promise -} { - const exporter = new InMemorySpanExporter() - const provider = new BasicTracerProvider({ - spanProcessors: [new SimpleSpanProcessor(exporter)], - }) - - // AsyncHooks context manager is required for nested step spans to attach - // to the workflow.run span across `await` boundaries. We register it - // globally because OTel's context API reads from the global manager. - const contextManager = new AsyncHooksContextManager().enable() - context.setGlobalContextManager(contextManager) - - const tracer = provider.getTracer('pg-workflows-test') - - return { - tracer, - getSpans: () => exporter.getFinishedSpans(), - getSpansByName: (name: string) => - exporter.getFinishedSpans().filter((s) => s.name === name), - teardown: async () => { - await provider.shutdown() - contextManager.disable() - context.disable() - trace.disable() - }, - } -} -``` - -- [ ] **Step 2: Run typecheck** - -Run: `npx tsc --noEmit` -Expected: clean. - -- [ ] **Step 3: Commit** - -```bash -git add src/plugins/otel-test-helpers.ts -git commit -m "test: add OTel test bootstrap helper" -``` - ---- - -## Task 5: Create plugin skeleton - -**Files:** -- Create: `src/plugins/otel.ts` -- Create: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Write the failing test — plugin registers and a workflow completes** - -Create `src/plugins/otel.test.ts`: - -```ts -import type pg from 'pg' -import type { PgBoss } from 'pg-boss' -import { afterAll, afterEach, beforeAll, beforeEach, describe, expect, it } from 'vitest' -import { workflow } from '../definition' -import { WorkflowEngine } from '../engine' -import { getBoss } from '../tests/pgboss' -import { closeTestDatabase, createTestDatabase } from '../tests/test-db' -import { WorkflowStatus } from '../types' -import { otelPlugin } from './otel' -import { setupOtel } from './otel-test-helpers' - -let testBoss: PgBoss -let testPool: pg.Pool - -beforeAll(async () => { - testPool = await createTestDatabase() - testBoss = await getBoss(testPool) -}) - -afterAll(async () => { - await closeTestDatabase() -}) - -describe('otelPlugin', () => { - let otel: ReturnType - let engine: WorkflowEngine - - beforeEach(async () => { - otel = setupOtel() - engine = new WorkflowEngine({ workflows: [], pool: testPool, boss: testBoss }) - await engine.start() - }) - - afterEach(async () => { - await engine.stop() - await otel.teardown() - }) - - it('registers and lets a workflow complete', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-smoke', - async ({ step }) => { - return await step.run('only', async () => 'ok') - }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-smoke', input: {} }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED, output: 'ok' }) - }) -}) -``` - -- [ ] **Step 2: Run test — should fail because `./otel` does not exist** - -Run: `npm run test:unit -- otel.test.ts` -Expected: FAIL with module-not-found. - -- [ ] **Step 3: Create skeleton plugin** - -Create `src/plugins/otel.ts`: - -```ts -import type { Tracer } from '@opentelemetry/api' -import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' - -export type OtelPluginOptions = { - /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ - tracer?: Tracer - /** Prefix for all span names. Defaults to `pg_workflows`. */ - spanNamePrefix?: string - /** Extra attributes merged onto the workflow.run span. */ - attributes?: (context: WorkflowContext) => Record -} - -const DEFAULT_PREFIX = 'pg_workflows' - -export function otelPlugin( - _options: OtelPluginOptions = {}, -): WorkflowPlugin { - return { - name: 'opentelemetry', - methods: () => ({}), - } -} -``` - -- [ ] **Step 4: Run test to verify it passes** - -Run: `npm run test:unit -- otel.test.ts` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): plugin skeleton" -``` - ---- - -## Task 6: Expose `resourceId` and `attempt` on `WorkflowContext` - -**Files:** -- Modify: `src/types.ts` -- Modify: `src/engine.ts` - -The `workflow.run` span (next task) needs `workflow.resource_id` and `workflow.attempt`. The current `WorkflowContext` doesn't expose them. Add the fields and populate from `run` in the engine. This is a pure-additive refactor — no new behaviour yet. - -- [ ] **Step 1: Extend `WorkflowContext` type** - -In `src/types.ts`, find the `WorkflowContext` type (around line 102) and add two new fields: - -```ts -export type WorkflowContext< - TInput extends InputParameters = InputParameters, - TStep extends StepBaseContext = StepBaseContext, -> = { - input: InferInputParameters - step: TStep - workflowId: string - runId: string - /** Tenant/scope identifier set when the run was started, if any. */ - resourceId?: string - /** Zero-based retry attempt number (= `run.retryCount`). */ - attempt: number - timeline: Record - logger: WorkflowLogger -} -``` - -- [ ] **Step 2: Populate the new fields in the engine** - -In `src/engine.ts`, in the `context` construction inside `handleWorkflowRun` (added in Task 3), change to: - -```ts - const context: WorkflowContext = { - input: run.input as InferInputParameters, - workflowId: run.workflowId, - runId: run.id, - resourceId: run.resourceId ?? undefined, - attempt: run.retryCount, - timeline: run.timeline, - logger: this.logger, - step: baseStep as WorkflowContext['step'], - }; -``` - -- [ ] **Step 3: Run typecheck and unit suite** - -Run: `npx tsc --noEmit && npm run test:unit` -Expected: all pass; no behaviour change yet. - -- [ ] **Step 4: Commit** - -```bash -git add src/types.ts src/engine.ts -git commit -m "feat(types): expose resourceId and attempt on WorkflowContext" -``` - ---- - -## Task 7: Implement `workflow.run` span (happy path) - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Add failing test for workflow.run span** - -Append to the `describe('otelPlugin', ...)` block in `src/plugins/otel.test.ts`: - -```ts -it('emits a workflow.run span on successful completion', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-wf-span', - async () => 'done', - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ - resourceId: 'tenant-1', - workflowId: 'otel-wf-span', - input: {}, - }) - await expect - .poll(async () => await engine.getRun({ runId: run.id, resourceId: 'tenant-1' })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - const spans = otel.getSpansByName('pg_workflows.workflow.run') - expect(spans).toHaveLength(1) - expect(spans[0].attributes).toMatchObject({ - 'workflow.id': 'otel-wf-span', - 'workflow.run_id': run.id, - 'workflow.resource_id': 'tenant-1', - 'workflow.attempt': 0, - }) - expect(spans[0].status.code).toBe(1) // SpanStatusCode.OK -}) -``` - -- [ ] **Step 2: Run test to verify it fails** - -Run: `npm run test:unit -- otel.test.ts -t "workflow.run span on successful"` -Expected: FAIL — span list is empty (plugin still has no `wrap`). - -- [ ] **Step 3: Implement `wrap` in the plugin** - -Replace the contents of `src/plugins/otel.ts` with: - -```ts -import { - type AttributeValue, - SpanStatusCode, - type Tracer, - trace, -} from '@opentelemetry/api' -import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' - -export type OtelPluginOptions = { - /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ - tracer?: Tracer - /** Prefix for all span names. Defaults to `pg_workflows`. */ - spanNamePrefix?: string - /** Extra attributes merged onto the workflow.run span. */ - attributes?: (context: WorkflowContext) => Record -} - -const DEFAULT_PREFIX = 'pg_workflows' - -export function otelPlugin( - options: OtelPluginOptions = {}, -): WorkflowPlugin { - const tracer = options.tracer ?? trace.getTracer('pg-workflows') - const prefix = options.spanNamePrefix ?? DEFAULT_PREFIX - const extraAttrs = options.attributes - - return { - name: 'opentelemetry', - - methods: () => ({}), - - wrap: (context, next) => - tracer.startActiveSpan( - `${prefix}.workflow.run`, - { - attributes: { - 'workflow.id': context.workflowId, - 'workflow.run_id': context.runId, - 'workflow.attempt': context.attempt, - ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), - ...(extraAttrs ? extraAttrs(context) : {}), - }, - }, - async (span) => { - try { - const result = await next() - span.setStatus({ code: SpanStatusCode.OK }) - return result - } finally { - span.end() - } - }, - ), - } -} -``` - -- [ ] **Step 4: Run the test** - -Run: `npm run test:unit -- otel.test.ts -t "workflow.run span on successful"` -Expected: PASS. - -- [ ] **Step 5: Run full unit suite — confirm no regressions** - -Run: `npm run test:unit` -Expected: all pass. - -- [ ] **Step 6: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): emit workflow.run span via wrap hook" -``` - ---- - -## Task 8: `workflow.run` span error path - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Write failing test** - -Append to `describe('otelPlugin', ...)` in `src/plugins/otel.test.ts`: - -```ts -it('records exception and ERROR status on workflow.run when handler throws', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-wf-throw', - async ({ step }) => { - await step.run('boom', async () => { - throw new Error('kaboom') - }) - }, - { retries: 0 }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-wf-throw', input: {} }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.FAILED }) - - const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] - expect(wfSpan.status.code).toBe(2) // SpanStatusCode.ERROR - expect(wfSpan.status.message).toBe('kaboom') - expect(wfSpan.events.some((e) => e.name === 'exception')).toBe(true) -}) -``` - -- [ ] **Step 2: Run to confirm failure** - -Run: `npm run test:unit -- otel.test.ts -t "ERROR status on workflow.run"` -Expected: FAIL — current `wrap` does not catch. - -- [ ] **Step 3: Update `wrap` to record exceptions** - -In `src/plugins/otel.ts`, replace the `wrap` arrow body: - -```ts - wrap: (context, next) => - tracer.startActiveSpan( - `${prefix}.workflow.run`, - { - attributes: { - 'workflow.id': context.workflowId, - 'workflow.run_id': context.runId, - 'workflow.attempt': context.attempt, - ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), - ...(extraAttrs ? extraAttrs(context) : {}), - }, - }, - async (span) => { - try { - const result = await next() - span.setStatus({ code: SpanStatusCode.OK }) - return result - } catch (err) { - const error = err instanceof Error ? err : new Error(String(err)) - span.recordException(error) - span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) - throw err - } finally { - span.end() - } - }, - ), -``` - -- [ ] **Step 4: Run test — should pass** - -Run: `npm run test:unit -- otel.test.ts -t "ERROR status on workflow.run"` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): record exception on workflow.run span on failure" -``` - ---- - -## Task 9: `step.run` span with cache-hit suppression and error handling - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Write three failing tests** - -Append to `src/plugins/otel.test.ts`: - -```ts -it('emits step.run span as a child of workflow.run', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-step-run-child', - async ({ step }) => { - return await step.run('foo', async () => 'bar') - }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-step-run-child', input: {} }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] - const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0] - expect(stepSpan).toBeDefined() - expect(stepSpan.attributes).toMatchObject({ 'step.id': 'foo', 'step.type': 'run' }) - expect(stepSpan.parentSpanContext?.spanId).toBe(wfSpan.spanContext().spanId) -}) - -it('skips step.run span on cache-hit replay', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-cache-skip', - async ({ step }) => { - const a = await step.run('first', async () => 'A') - await step.waitFor('gate', { eventName: 'go' }) - const b = await step.run('second', async () => 'B') - return { a, b } - }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-cache-skip', input: {} }) - - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.PAUSED }) - - // First execution: workflow.run + step.run('first') + step.waitFor('gate') - expect(otel.getSpansByName('pg_workflows.step.run').map((s) => s.attributes['step.id'])).toEqual([ - 'first', - ]) - - await engine.triggerEvent({ runId: run.id, eventName: 'go' }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - // Second execution: NEW workflow.run + step.run('second') only. - // 'first' is a cache hit and emits no span. - const stepRunSpans = otel.getSpansByName('pg_workflows.step.run') - const ids = stepRunSpans.map((s) => s.attributes['step.id']) - expect(ids).toEqual(['first', 'second']) - expect(otel.getSpansByName('pg_workflows.workflow.run')).toHaveLength(2) -}) - -it('records exception and ERROR status on step.run when handler throws', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-step-throw', - async ({ step }) => { - await step.run('explode', async () => { - throw new Error('nope') - }) - }, - { retries: 0 }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-step-throw', input: {} }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.FAILED }) - - const stepSpan = otel.getSpansByName('pg_workflows.step.run')[0] - expect(stepSpan.status.code).toBe(2) - expect(stepSpan.status.message).toBe('nope') - expect(stepSpan.events.some((e) => e.name === 'exception')).toBe(true) -}) -``` - -- [ ] **Step 2: Run tests to confirm they fail** - -Run: `npm run test:unit -- otel.test.ts -t "step.run"` -Expected: all three FAIL — `methods` is still `() => ({})`. - -- [ ] **Step 3: Add a cache-hit predicate and step.run wrapper** - -In `src/plugins/otel.ts`, replace the file with this complete version: - -```ts -import { - type AttributeValue, - SpanStatusCode, - type Tracer, - trace, -} from '@opentelemetry/api' -import type { StepBaseContext, WorkflowContext, WorkflowPlugin } from '../types' - -export type OtelPluginOptions = { - /** Tracer to use. Defaults to `trace.getTracer('pg-workflows')`. */ - tracer?: Tracer - /** Prefix for all span names. Defaults to `pg_workflows`. */ - spanNamePrefix?: string - /** Extra attributes merged onto the workflow.run span. */ - attributes?: (context: WorkflowContext) => Record -} - -const DEFAULT_PREFIX = 'pg_workflows' - -function isCachedHit(timeline: Record, stepId: string): boolean { - const entry = timeline[stepId] - if ( - entry && - typeof entry === 'object' && - 'output' in entry && - (entry as { output: unknown }).output !== undefined - ) { - return true - } - return false -} - -async function traceStep( - tracer: Tracer, - name: string, - attrs: Record, - fn: () => Promise, -): Promise { - return tracer.startActiveSpan(name, { attributes: attrs }, async (span) => { - try { - const result = await fn() - span.setStatus({ code: SpanStatusCode.OK }) - return result - } catch (err) { - const error = err instanceof Error ? err : new Error(String(err)) - span.recordException(error) - span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) - throw err - } finally { - span.end() - } - }) -} - -export function otelPlugin( - options: OtelPluginOptions = {}, -): WorkflowPlugin { - const tracer = options.tracer ?? trace.getTracer('pg-workflows') - const prefix = options.spanNamePrefix ?? DEFAULT_PREFIX - const extraAttrs = options.attributes - - return { - name: 'opentelemetry', - - methods: (step, context) => ({ - run: async (stepId: string, handler: () => Promise) => { - if (isCachedHit(context.timeline, stepId)) { - return step.run(stepId, handler) - } - return traceStep( - tracer, - `${prefix}.step.run`, - { 'step.id': stepId, 'step.type': 'run' }, - () => step.run(stepId, handler), - ) - }, - }), - - wrap: (context, next) => - tracer.startActiveSpan( - `${prefix}.workflow.run`, - { - attributes: { - 'workflow.id': context.workflowId, - 'workflow.run_id': context.runId, - 'workflow.attempt': context.attempt, - ...(context.resourceId ? { 'workflow.resource_id': context.resourceId } : {}), - ...(extraAttrs ? extraAttrs(context) : {}), - }, - }, - async (span) => { - try { - const result = await next() - span.setStatus({ code: SpanStatusCode.OK }) - return result - } catch (err) { - const error = err instanceof Error ? err : new Error(String(err)) - span.recordException(error) - span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }) - throw err - } finally { - span.end() - } - }, - ), - } -} -``` - -Note: `methods` overrides `run` only — `step.run` returns the existing base method otherwise (which lives on the `step` object passed in). The other base methods (`waitFor`, `pause`, etc.) are still accessible because the engine merges `extra` over `step` (see `src/engine.ts:1128-1129`); overriding `run` shadows only that one method. - -- [ ] **Step 4: Run all three step.run tests — they should pass** - -Run: `npm run test:unit -- otel.test.ts -t "step.run"` -Expected: PASS. - -- [ ] **Step 5: Run full unit suite — confirm no regressions** - -Run: `npm run test:unit` -Expected: all pass. - -- [ ] **Step 6: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): wrap step.run with span, cache-hit suppression, error path" -``` - ---- - -## Task 10: Spans for `waitFor`, `delay`, `waitUntil`, `pause` - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Write failing test** - -Append to `src/plugins/otel.test.ts`: - -```ts -it('emits spans for waitFor, delay, waitUntil, pause', async () => { - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-other-steps', - async ({ step }) => { - await step.waitFor('wf', { eventName: 'evt' }) - await step.delay('d', '1ms') - await step.waitUntil('wu', new Date(Date.now() + 1)) - await step.pause('p') - return 'ok' - }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-other-steps', input: {} }) - - // Workflow pauses immediately on first waitFor; resume it through completion. - const drive = async () => { - for (let i = 0; i < 20; i++) { - const r = await engine.getRun({ runId: run.id }) - if (r.status === WorkflowStatus.PAUSED) break - await new Promise((res) => setTimeout(res, 25)) - } - } - await drive() - await engine.triggerEvent({ runId: run.id, eventName: 'evt' }) - await drive() - // delay + waitUntil resolve themselves; pause needs an explicit resume - await engine.resumeWorkflow({ runId: run.id }) - await expect - .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - const stepNames = otel - .getSpans() - .map((s) => s.name) - .filter((n) => n.startsWith('pg_workflows.step.')) - expect(stepNames).toEqual( - expect.arrayContaining([ - 'pg_workflows.step.waitFor', - 'pg_workflows.step.delay', - 'pg_workflows.step.waitUntil', - 'pg_workflows.step.pause', - ]), - ) - const waitForSpan = otel.getSpansByName('pg_workflows.step.waitFor')[0] - expect(waitForSpan.attributes).toMatchObject({ 'step.id': 'wf', 'step.type': 'waitFor' }) -}) -``` - -- [ ] **Step 2: Run test — should fail** - -Run: `npm run test:unit -- otel.test.ts -t "spans for waitFor"` -Expected: FAIL. - -- [ ] **Step 3: Extend `methods` with the four new wrappers** - -In `src/plugins/otel.ts`, replace the `methods` field of the returned plugin with: - -```ts - methods: (step, context) => ({ - run: async (stepId: string, handler: () => Promise) => { - if (isCachedHit(context.timeline, stepId)) { - return step.run(stepId, handler) - } - return traceStep( - tracer, - `${prefix}.step.run`, - { 'step.id': stepId, 'step.type': 'run' }, - () => step.run(stepId, handler), - ) - }, - waitFor: ((stepId: string, opts: Parameters[1]) => { - if (isCachedHit(context.timeline, stepId)) { - return step.waitFor(stepId, opts) - } - return traceStep( - tracer, - `${prefix}.step.waitFor`, - { 'step.id': stepId, 'step.type': 'waitFor' }, - () => step.waitFor(stepId, opts) as Promise, - ) - }) as StepBaseContext['waitFor'], - delay: async (stepId: string, duration: Parameters[1]) => { - if (isCachedHit(context.timeline, stepId)) { - return step.delay(stepId, duration) - } - await traceStep( - tracer, - `${prefix}.step.delay`, - { 'step.id': stepId, 'step.type': 'delay' }, - () => step.delay(stepId, duration), - ) - }, - waitUntil: ((stepId: string, dateOrOptions: Parameters[1]) => { - if (isCachedHit(context.timeline, stepId)) { - return step.waitUntil(stepId, dateOrOptions) - } - return traceStep( - tracer, - `${prefix}.step.waitUntil`, - { 'step.id': stepId, 'step.type': 'waitUntil' }, - () => step.waitUntil(stepId, dateOrOptions), - ) - }) as StepBaseContext['waitUntil'], - pause: async (stepId: string) => { - if (isCachedHit(context.timeline, stepId)) { - return step.pause(stepId) - } - await traceStep( - tracer, - `${prefix}.step.pause`, - { 'step.id': stepId, 'step.type': 'pause' }, - () => step.pause(stepId), - ) - }, - }), -``` - -The `as StepBaseContext['waitFor']` / `as StepBaseContext['waitUntil']` casts are required because both methods are overloaded — TypeScript can't infer the overload union from the implementation alone. - -- [ ] **Step 4: Run test** - -Run: `npm run test:unit -- otel.test.ts -t "spans for waitFor"` -Expected: PASS. - -- [ ] **Step 5: Run full unit suite** - -Run: `npm run test:unit` -Expected: all pass. - -- [ ] **Step 6: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): wrap waitFor, delay, waitUntil, pause with spans" -``` - ---- - -## Task 11: `step.poll` span - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Write failing test** - -Append to `src/plugins/otel.test.ts`: - -```ts -it('emits step.poll span on each poll attempt', async () => { - let attempt = 0 - const w = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-poll', - async ({ step }) => { - const result = await step.poll( - 'poller', - async () => { - attempt += 1 - return attempt >= 2 ? { value: attempt } : false - }, - { interval: '30s', timeout: '60s' }, - ) - return result - }, - ) - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-poll', input: {} }) - - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.PAUSED }) - - // First execution emitted exactly one step.poll span - const firstPolls = otel.getSpansByName('pg_workflows.step.poll') - expect(firstPolls).toHaveLength(1) - expect(firstPolls[0].attributes).toMatchObject({ 'step.id': 'poller', 'step.type': 'poll' }) - - // Simulate the poll-interval re-fire via fastForwardWorkflow - await engine.fastForwardWorkflow({ runId: run.id }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - // Second execution emits a new poll span (the previous one is not a cache hit - // because the step's *output* is not yet in timeline, only a poll-state entry) - expect(otel.getSpansByName('pg_workflows.step.poll').length).toBeGreaterThanOrEqual(2) -}) -``` - -- [ ] **Step 2: Run — should fail** - -Run: `npm run test:unit -- otel.test.ts -t "step.poll"` -Expected: FAIL. - -- [ ] **Step 3: Add `poll` wrapper to `methods`** - -In `src/plugins/otel.ts`, inside the `methods` returned object (Task 10), add after `pause`: - -```ts - poll: (async ( - stepId: string, - conditionFn: () => Promise, - pollOptions?: Parameters[2], - ) => { - if (isCachedHit(context.timeline, stepId)) { - return step.poll(stepId, conditionFn, pollOptions) - } - return traceStep( - tracer, - `${prefix}.step.poll`, - { 'step.id': stepId, 'step.type': 'poll' }, - () => step.poll(stepId, conditionFn, pollOptions), - ) - }) as StepBaseContext['poll'], -``` - -- [ ] **Step 4: Run tests** - -Run: `npm run test:unit -- otel.test.ts -t "step.poll"` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): wrap step.poll with span" -``` - ---- - -## Task 12: `step.invokeChildWorkflow` span with binding-key cache check - -**Files:** -- Modify: `src/plugins/otel.ts` -- Modify: `src/plugins/otel.test.ts` - -The cache-hit detection for `invokeChildWorkflow` is different: an in-flight child resume has a binding entry (`__invokeChildWorkflow:`) but no `[stepId].output` yet. We must skip the span in that case too. - -- [ ] **Step 1: Write failing test** - -Append to `src/plugins/otel.test.ts`: - -```ts -it('emits invokeChildWorkflow span on creation and skips on cache-hit resume', async () => { - const child = workflow('otel-child', async () => 'child-done') - await engine.registerWorkflow(child) - - const parent = workflow.use(otelPlugin({ tracer: otel.tracer }))( - 'otel-parent', - async ({ step }) => { - const r = await step.invokeChildWorkflow('call-child', child) - return r - }, - ) - await engine.registerWorkflow(parent) - const run = await engine.startWorkflow({ workflowId: 'otel-parent', input: {} }) - - await expect - .poll(async () => await engine.getRun({ runId: run.id }), { timeout: 5000 }) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - const invokeSpans = otel.getSpansByName('pg_workflows.step.invokeChildWorkflow') - expect(invokeSpans).toHaveLength(1) - expect(invokeSpans[0].attributes).toMatchObject({ - 'step.id': 'call-child', - 'step.type': 'invokeChildWorkflow', - }) -}) -``` - -The single-span assertion proves both behaviors: a span is emitted on the create-and-pause execution, and on the resume execution the cached binding (plus eventual cached output) prevents a duplicate span. - -- [ ] **Step 2: Run — should fail** - -Run: `npm run test:unit -- otel.test.ts -t "invokeChildWorkflow"` -Expected: FAIL. - -- [ ] **Step 3: Import the binding-key helper and extend cache predicate** - -In `src/plugins/otel.ts`, add an import at the top: - -```ts -import { invokeChildWorkflowTimelineKey } from '../constants' -``` - -Replace `isCachedHit` with a kind-aware version: - -```ts -function isCachedHit( - timeline: Record, - stepId: string, - kind: 'run' | 'waitFor' | 'delay' | 'waitUntil' | 'pause' | 'poll' | 'invokeChildWorkflow', -): boolean { - const entry = timeline[stepId] - if ( - entry && - typeof entry === 'object' && - 'output' in entry && - (entry as { output: unknown }).output !== undefined - ) { - return true - } - if (kind === 'invokeChildWorkflow' && timeline[invokeChildWorkflowTimelineKey(stepId)]) { - return true - } - return false -} -``` - -Update every existing caller in `methods` to pass the new `kind` arg. Example for `run`: - -```ts - if (isCachedHit(context.timeline, stepId, 'run')) { - return step.run(stepId, handler) - } -``` - -Apply the same pattern to `waitFor` ('waitFor'), `delay` ('delay'), `waitUntil` ('waitUntil'), `pause` ('pause'), `poll` ('poll'). - -- [ ] **Step 4: Add the `invokeChildWorkflow` wrapper to `methods`** - -Inside the `methods` returned object, after `poll`, add: - -```ts - invokeChildWorkflow: (async ( - stepId: string, - refOrParams: Parameters[1], - inputArg?: unknown, - optionsArg?: unknown, - ) => { - if (isCachedHit(context.timeline, stepId, 'invokeChildWorkflow')) { - return (step.invokeChildWorkflow as ( - ...args: unknown[] - ) => Promise)(stepId, refOrParams, inputArg, optionsArg) - } - return traceStep( - tracer, - `${prefix}.step.invokeChildWorkflow`, - { 'step.id': stepId, 'step.type': 'invokeChildWorkflow' }, - () => - (step.invokeChildWorkflow as ( - ...args: unknown[] - ) => Promise)(stepId, refOrParams, inputArg, optionsArg), - ) - }) as StepBaseContext['invokeChildWorkflow'], -``` - -- [ ] **Step 5: Run test** - -Run: `npm run test:unit -- otel.test.ts -t "invokeChildWorkflow"` -Expected: PASS. - -- [ ] **Step 6: Run full unit suite** - -Run: `npm run test:unit` -Expected: all pass. - -- [ ] **Step 7: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "feat(otel): wrap step.invokeChildWorkflow with binding-aware cache check" -``` - ---- - -## Task 13: Cache-hit predicate unit test - -**Files:** -- Modify: `src/plugins/otel.test.ts` -- Modify: `src/plugins/otel.ts` (export `isCachedHit`) - -- [ ] **Step 1: Export `isCachedHit` from the plugin module** - -In `src/plugins/otel.ts`, change `function isCachedHit` to `export function isCachedHit`. - -- [ ] **Step 2: Write the unit test** - -Append to `src/plugins/otel.test.ts` *outside* the existing `describe('otelPlugin', ...)` block (top level inside the file): - -```ts -import { invokeChildWorkflowTimelineKey } from '../constants' -import { isCachedHit } from './otel' - -describe('isCachedHit', () => { - it('returns true when output is recorded for stepId', () => { - expect(isCachedHit({ s: { output: 'x', timestamp: new Date() } }, 's', 'run')).toBe(true) - }) - - it('returns false when output is undefined', () => { - expect(isCachedHit({ s: { output: undefined, timestamp: new Date() } }, 's', 'run')).toBe( - false, - ) - }) - - it('returns false when timeline has no entry for stepId', () => { - expect(isCachedHit({}, 's', 'run')).toBe(false) - }) - - it('returns false for non-object entry', () => { - expect(isCachedHit({ s: 'not-an-object' }, 's', 'run')).toBe(false) - }) - - it('returns true for invokeChildWorkflow when only the binding key is present', () => { - const timeline = { [invokeChildWorkflowTimelineKey('s')]: { invokeChildWorkflow: {} } } - expect(isCachedHit(timeline, 's', 'invokeChildWorkflow')).toBe(true) - expect(isCachedHit(timeline, 's', 'run')).toBe(false) - }) -}) -``` - -- [ ] **Step 3: Run tests** - -Run: `npm run test:unit -- otel.test.ts -t "isCachedHit"` -Expected: PASS (all 5 cases). - -- [ ] **Step 4: Commit** - -```bash -git add src/plugins/otel.ts src/plugins/otel.test.ts -git commit -m "test(otel): direct coverage for isCachedHit predicate" -``` - ---- - -## Task 14: Plugin composition order with otelPlugin - -**Files:** -- Modify: `src/plugins/otel.test.ts` - -- [ ] **Step 1: Add composition test** - -Append to the `describe('otelPlugin', ...)` block in `src/plugins/otel.test.ts`: - -```ts -it('composes wrap with another plugin in registration order', async () => { - const calls: string[] = [] - const trackerPlugin: WorkflowPlugin = { - name: 'tracker', - methods: () => ({}), - wrap: async (_ctx, next) => { - calls.push('tracker:before') - const r = await next() - calls.push('tracker:after') - return r - }, - } - - const w = workflow - .use(trackerPlugin) - .use(otelPlugin({ tracer: otel.tracer }))('otel-compose', async () => 'ok') - await engine.registerWorkflow(w) - const run = await engine.startWorkflow({ workflowId: 'otel-compose', input: {} }) - await expect - .poll(async () => await engine.getRun({ runId: run.id })) - .toMatchObject({ status: WorkflowStatus.COMPLETED }) - - // tracker registered first, so its wrap is outermost — its before runs - // before the workflow.run span opens, and its after runs after the span ends. - const wfSpan = otel.getSpansByName('pg_workflows.workflow.run')[0] - expect(wfSpan).toBeDefined() - expect(calls).toEqual(['tracker:before', 'tracker:after']) -}) -``` - -Add `import type { StepBaseContext, WorkflowPlugin } from '../types'` to the top of the file if not already present. - -- [ ] **Step 2: Run test** - -Run: `npm run test:unit -- otel.test.ts -t "composes wrap"` -Expected: PASS. - -- [ ] **Step 3: Commit** - -```bash -git add src/plugins/otel.test.ts -git commit -m "test(otel): verify plugin composition order with another wrap" -``` - ---- - -## Task 15: Export `otelPlugin` and document - -**Files:** -- Modify: `src/index.ts` -- Modify: `README.md` -- Modify: `AGENTS.md` - -- [ ] **Step 1: Re-export from the main entry** - -In `src/index.ts`, add: - -```ts -export { otelPlugin, type OtelPluginOptions } from './plugins/otel' -``` - -- [ ] **Step 2: Add Observability section to README.md** - -In `README.md`, add a new top-level section near the existing API documentation (preserve the project's tone and heading level — `##`): - -````markdown -## Observability with OpenTelemetry - -pg-workflows ships a first-party plugin that emits OTel spans for workflow and step execution. `@opentelemetry/api` is an optional peer dependency — install it only if you want tracing. - -```bash -npm install @opentelemetry/api @opentelemetry/sdk-node -``` - -```ts -import { NodeSDK } from '@opentelemetry/sdk-node' -import { trace } from '@opentelemetry/api' -import { workflow, otelPlugin } from 'pg-workflows' - -// Initialize your OTel SDK however you normally do — for Node apps the -// NodeSDK registers an AsyncHooks context manager, which is required for -// hierarchical (parent/child) spans across async boundaries. -new NodeSDK({ /* exporters, resource, ... */ }).start() - -const tracedWorkflow = workflow.use(otelPlugin()) - -const myWorkflow = tracedWorkflow('checkout', async ({ step }) => { - await step.run('charge', async () => { /* ... */ }) - await step.waitFor('await-shipment', { eventName: 'shipped' }) -}) -``` - -The plugin emits a `pg_workflows.workflow.run` span per worker execution (one per resume cycle), with child spans per step kind (`pg_workflows.step.run`, `pg_workflows.step.waitFor`, etc.). Spans carry `workflow.id`, `workflow.run_id`, `workflow.attempt` and, where set, `workflow.resource_id`. Steps replayed from cache after a pause emit no spans. - -**Options:** - -```ts -otelPlugin({ - tracer: trace.getTracer('my-app'), // default: trace.getTracer('pg-workflows') - spanNamePrefix: 'pg_workflows', // default shown - attributes: (ctx) => ({ tenant: ctx.resourceId }), // extra static attrs on workflow.run -}) -``` - -Metrics, distributed trace context propagation across child workflows, and HTTP-caller context propagation are not in v1 — see [the design doc](docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md) for the deferral rationale. -```` - -- [ ] **Step 3: Add a bullet to AGENTS.md under Core API** - -In `AGENTS.md` (which is also `CLAUDE.md`), find the `## Core API` section. Add a new subsection after the existing `WorkflowEngine` block: - -```markdown -### `otelPlugin(options?)` - OpenTelemetry tracing - -```typescript -import { workflow, otelPlugin } from 'pg-workflows'; - -// Optional peer dep: install `@opentelemetry/api` and an OTel SDK (e.g. NodeSDK). -// One `pg_workflows.workflow.run` span per worker execution, with child spans -// per step kind. Spans replayed from cache after a pause are suppressed. -const tracedWorkflow = workflow.use(otelPlugin({ - // tracer?: Tracer // default: trace.getTracer('pg-workflows') - // spanNamePrefix?: string // default: 'pg_workflows' - // attributes?: (ctx) => Record -})); -``` -``` - -- [ ] **Step 4: Run full unit suite and build** - -Run: `npm run test:unit` -Expected: all pass. - -Run: `npm run build` -Expected: exits 0. - -Run: `npm run lint` -Expected: exits 0 (or run `npm run lint:fix` and re-stage if Biome flags formatting). - -- [ ] **Step 5: Commit** - -```bash -git add src/index.ts README.md AGENTS.md -git commit -m "feat(otel): export otelPlugin and document usage" -``` - ---- - -## Verification before declaring done - -- [ ] **Step 1: Full test suite passes** - -Run: `npm test` -Expected: unit + integration both green. If integration requires a Postgres URL the user hasn't provided, run only `npm run test:unit` and note the gap. - -- [ ] **Step 2: Build cleanly** - -Run: `npm run clean && npm run build` -Expected: exits 0. `dist/` contains the plugin output. - -- [ ] **Step 3: Lint** - -Run: `npm run lint` -Expected: clean. Otherwise `npm run lint:fix` and re-stage anything modified. - -- [ ] **Step 4: Spec coverage walk-through** - -Open `docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md` and confirm every "In scope" bullet has a matching task. Confirm every "Out of scope for v1" bullet is documented in the README's deferral pointer. diff --git a/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md b/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md deleted file mode 100644 index 2745165..0000000 --- a/docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md +++ /dev/null @@ -1,202 +0,0 @@ -# OpenTelemetry Instrumentation — Design - -- **Issue:** [#34](https://github.com/SokratisVidros/pg-workflows/issues/34) -- **Status:** Approved for implementation -- **Date:** 2026-05-21 - -## Goal - -Allow pg-workflows users to emit OpenTelemetry traces for workflow and step execution, with zero runtime cost when unused. - -## Scope (v1) - -**In scope:** - -- A first-party plugin, `otelPlugin`, shipped from the `pg-workflows` package. -- A `workflow.run` span per worker execution of a workflow run, with child spans for each step kind (`step.run`, `step.waitFor`, `step.delay`, `step.waitUntil`, `step.pause`, `step.poll`, `step.invokeChildWorkflow`). -- Hierarchical traces via OpenTelemetry's AsyncLocalStorage active context (no manual context plumbing in user workflows). -- Suppression of spans for cache-hit step replays. -- Optional peer dependency on `@opentelemetry/api`. Non-users pay zero cost. - -**Out of scope for v1** (see [Out of scope](#out-of-scope-for-v1) below for rationale and deferral notes). - -## Decisions - -| Decision | Choice | -| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| Distribution | First-party plugin in `pg-workflows`. Optional peer dep on `@opentelemetry/api`. | -| Scope | Step spans + a parent `workflow.run` span (hierarchical traces). Metrics deferred. | -| Span lifetime | One span per worker execution of the run. A long-paused workflow produces multiple traces, stitched via `workflow.id` / `workflow.run_id` attributes. | -| Plugin hook shape | A new optional `wrap(context, next)` hook on `WorkflowPlugin`. Composes as middleware. Better fit for `tracer.startActiveSpan` than a before/after pair. | -| Cache-hit replay handling | Skip spans for cache-hit step calls. Detected via `context.timeline[stepId]?.output !== undefined` (plus the invoke-child binding key for that step kind). | - -## Architecture - -### Plugin interface extension (`src/types.ts`) - -```ts -export interface WorkflowPlugin { - name: string; - methods: (step: TStepBase, context: WorkflowContext) => TStepExt; - wrap?: (context: WorkflowContext, next: () => Promise) => Promise; -} -``` - -`methods` gains a `context` argument so plugins can inspect the timeline for cache-hit detection. The change is additive — existing plugins that ignore the new arg compile unchanged. - -`wrap` is optional. When present, the engine inserts it into a middleware chain around the workflow handler invocation. - -### Engine wiring (`src/engine.ts`) - -Inside `handleWorkflowRun`, after composing `step` via `plugin.methods(step, context)`, the handler call site changes from: - -```ts -const result = await workflow.handler(context); -``` - -to: - -```ts -let next: () => Promise = () => workflow.handler(context); -for (const plugin of [...plugins].reverse()) { - if (plugin.wrap) { - const inner = next; - next = () => plugin.wrap!(context, inner); - } -} -const result = await next(); -``` - -Order rules: the first plugin passed to `.use()` is the outermost wrap. Multiple plugins compose as standard middleware. - -### OTel plugin (`src/plugins/otel.ts`) - -Exported from the package's main entry as `otelPlugin`. - -**Public API:** - -```ts -import { otelPlugin } from 'pg-workflows'; -import { trace } from '@opentelemetry/api'; - -const tracedWorkflow = workflow.use(otelPlugin({ - tracer: trace.getTracer('my-app'), // optional; default: trace.getTracer('pg-workflows', VERSION) - spanNamePrefix: 'pg_workflows', // optional; default shown - attributes: (ctx) => ({ tenant: ctx.input.tenantId }), // optional; merged onto workflow.run span -})); -``` - -**Behaviour:** - -- `wrap` opens a `${spanNamePrefix}.workflow.run` active span around `next()`. On thrown error: `span.recordException(err)`, `setStatus({ code: ERROR })`, re-throw. On clean return: `setStatus({ code: OK })`. Span ends in `finally`. -- `methods` returns a step API where every method is wrapped to open `${spanNamePrefix}.step.` spans, but only when the corresponding timeline slot is empty. -- All spans share parent context via `tracer.startActiveSpan`. AsyncLocalStorage handles propagation through `await` boundaries automatically. - -### Span hierarchy and attributes - -``` -pg_workflows.workflow.run -├── pg_workflows.step.run -├── pg_workflows.step.waitFor -├── pg_workflows.step.delay -├── pg_workflows.step.waitUntil -├── pg_workflows.step.pause -├── pg_workflows.step.poll -└── pg_workflows.step.invokeChildWorkflow -``` - -| Span | Attributes | -| ------------------------------------ | --------------------------------------------------------------------------------------------------- | -| `workflow.run` | `workflow.id`, `workflow.run_id`, `workflow.resource_id` (if present), `workflow.attempt` (= `run.retryCount`), plus anything from the user's `attributes(ctx)` callback | -| `step.` (all kinds) | `step.id`, `step.type` (matches the `StepType` enum value) | -| `step.invokeChildWorkflow` | Plus `child.workflow_id`, `child.run_id` once the child run has been created | -| Any span on error | `recordException(err)`, `setStatus({ code: ERROR, message })` | - -### Cache-hit suppression - -Before opening a span, each wrapped step method checks: - -```ts -function isCachedHit(ctx: WorkflowContext, stepId: string, kind: StepType): boolean { - const entry = ctx.timeline[stepId]; - if (entry && typeof entry === 'object' && 'output' in entry && (entry as any).output !== undefined) { - return true; - } - if (kind === StepType.INVOKE_CHILD_WORKFLOW) { - const binding = ctx.timeline[`__invokeChildWorkflow:${stepId}`]; - if (binding) return true; // in-flight resume; will produce no new work this execution - } - return false; -} -``` - -When cached, the wrapper passes through to the base step method without opening a span. The timeline snapshot is taken at handler entry, so steps completed during the *current* execution are still spanned correctly. - -### Packaging - -In `package.json`: - -```json -"peerDependencies": { - "pg": "^8.0.0", - "@opentelemetry/api": "^1.9.0" -}, -"peerDependenciesMeta": { - "@opentelemetry/api": { "optional": true } -} -``` - -The OTel plugin file imports `@opentelemetry/api` directly. Users who never import `otelPlugin` never load this module, so the optional peer never resolves. - -Devs add `@opentelemetry/sdk-trace-base` to `devDependencies` for tests. - -## Testing - -Lives in `src/plugins/otel.test.ts`, runs in the existing unit suite (PGlite-backed). - -Test setup registers a `BasicTracerProvider` with an `InMemorySpanExporter` once per test, asserts against `exporter.getFinishedSpans()`. - -Cases: - -1. **Single-step happy path** — one `step.run` produces exactly 2 spans: `workflow.run` parent + `step.run` child. Attributes match. Both `OK`. -2. **Multi-step with pause** — workflow runs `step1.run` → `step2.waitFor`. First execution emits `workflow.run` + `step1.run` + `step2.waitFor`. `triggerEvent` resumes; second execution emits a new `workflow.run` trace containing only the post-pause work (cached `step1` and the resumed `step2` emit no spans). -3. **Step throws** — `step.run`'s handler throws. The `step.run` span has `ERROR` status with a recorded exception. The error propagates so `run.error` is persisted and pg-boss retry semantics are unchanged. -4. **`invokeChildWorkflow` cache replay** — parent's `step.invokeChildWorkflow` span is emitted on the pause execution. On the resume execution, the binding key is present and the cached output completes, so no span is emitted. -5. **Plugin composition order** — register a trivial second wrap plugin alongside `otelPlugin` (in both orders) and assert wraps compose in `.use()` registration order. -6. **Cache-hit predicate unit test** — direct test of the `isCachedHit` predicate against the timeline shapes produced by each step kind. - -## Documentation - -- New "Observability with OpenTelemetry" section in `README.md` with a ~10-line quickstart: register provider → `.use(otelPlugin())` → done. -- JSDoc on `otelPlugin` listing all options and defaults. -- Bullet under "Core API" in `AGENTS.md` pointing to the plugin. - -## Out of scope for v1 - -These items appear in the original issue but are deferred. Documented here so they aren't lost. - -### Metrics - -The issue proposes `pg_workflows.workflow.started`, `pg_workflows.workflow.completed`, `pg_workflows.step.duration`, `pg_workflows.queue.depth`. These use OTel's metrics API (`@opentelemetry/api/metrics`), a separate surface from traces. They can layer onto the same plugin hooks added in v1, so the v1 plugin interface remains forward-compatible. - -`queue.depth` is harder than the rest — pg-boss does not expose a synchronous queue-size primitive; implementing it requires either polling `pgboss.job` or a counter maintained at enqueue/dequeue time. Defer until there is concrete demand. - -### Cross-execution trace context propagation - -When a workflow pauses and resumes, the resume execution gets a fresh root span — there is no link to the previous execution's trace beyond shared `workflow.run_id` attributes. Linking them would require persisting the trace context (`traceparent` header value) somewhere durable, e.g. in `workflow_runs.timeline` or a dedicated column. - -Same for `step.invokeChildWorkflow`: child runs currently start a fresh root span rather than continuing the parent's trace. - -Both deferred together because they share the persistence design question. - -### `engine.startWorkflow` caller context propagation - -When an HTTP request invokes `engine.startWorkflow`, the request's incoming trace context is not propagated into the workflow run. Same persistence question as above; deferred together. - -### DLQ span emission - -`handleWorkflowRunDlq` runs outside the workflow's plugin chain (no handler invocation, no `context` object). DLQ-induced FAILED states therefore produce no `workflow.run` span. This is acceptable for v1 because the precipitating error is already recorded on the last per-execution `workflow.run` span via the catch path. Revisit if users report missing visibility on final-failure reconciliation. - -### Sampling, head-based vs tail-based decisions - -The plugin defers to the user's configured `TracerProvider` for sampling. No plugin-level sampling controls in v1. From 3b5eef89073198236ce2bd866f79e8514852bb94 Mon Sep 17 00:00:00 2001 From: Sokratis Vidros Date: Tue, 26 May 2026 10:43:59 +0300 Subject: [PATCH 21/21] build: sync bun.lock with new OpenTelemetry deps CI uses `bun install --frozen-lockfile` so the bun lockfile must match the deps added to package.json. We added @opentelemetry/api, @opentelemetry/context-async-hooks, and @opentelemetry/sdk-trace-base via npm install, which only refreshed package-lock.json. Regenerate bun.lock so CI no longer rejects the install step. Co-Authored-By: Claude Opus 4.7 (1M context) --- bun.lock | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/bun.lock b/bun.lock index 4897864..6f93878 100644 --- a/bun.lock +++ b/bun.lock @@ -15,6 +15,9 @@ "devDependencies": { "@biomejs/biome": "^2.3.10", "@electric-sql/pglite": "^0.3.14", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/context-async-hooks": "^1.27.0", + "@opentelemetry/sdk-trace-base": "^1.27.0", "@types/node": "^22.10.2", "@types/pg": "^8.11.10", "bunup": "^0.16.11", @@ -24,8 +27,12 @@ "zod": "^3.24.0", }, "peerDependencies": { + "@opentelemetry/api": "^1.9.0", "pg": "^8.0.0", }, + "optionalPeers": [ + "@opentelemetry/api", + ], }, }, "packages": { @@ -123,6 +130,18 @@ "@napi-rs/wasm-runtime": ["@napi-rs/wasm-runtime@1.1.0", "", { "dependencies": { "@emnapi/core": "^1.7.1", "@emnapi/runtime": "^1.7.1", "@tybys/wasm-util": "^0.10.1" } }, "sha512-Fq6DJW+Bb5jaWE69/qOE0D1TUN9+6uWhCeZpdnSBk14pjLcCWR7Q8n49PTSPHazM37JqrsdpEthXy2xn6jWWiA=="], + "@opentelemetry/api": ["@opentelemetry/api@1.9.1", "", {}, "sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q=="], + + "@opentelemetry/context-async-hooks": ["@opentelemetry/context-async-hooks@1.30.1", "", { "peerDependencies": { "@opentelemetry/api": ">=1.0.0 <1.10.0" } }, "sha512-s5vvxXPVdjqS3kTLKMeBMvop9hbWkwzBpu+mUO2M7sZtlkyDJGwFe33wRKnbaYDo8ExRVBIIdwIGrqpxHuKttA=="], + + "@opentelemetry/core": ["@opentelemetry/core@1.30.1", "", { "dependencies": { "@opentelemetry/semantic-conventions": "1.28.0" }, "peerDependencies": { "@opentelemetry/api": ">=1.0.0 <1.10.0" } }, "sha512-OOCM2C/QIURhJMuKaekP3TRBxBKxG/TWWA0TL2J6nXUtDnuCtccy49LUJF8xPFXMX+0LMcxFpCo8M9cGY1W6rQ=="], + + "@opentelemetry/resources": ["@opentelemetry/resources@1.30.1", "", { "dependencies": { "@opentelemetry/core": "1.30.1", "@opentelemetry/semantic-conventions": "1.28.0" }, "peerDependencies": { "@opentelemetry/api": ">=1.0.0 <1.10.0" } }, "sha512-5UxZqiAgLYGFjS4s9qm5mBVo433u+dSPUFWVWXmLAD4wB65oMCoXaJP1KJa9DIYYMeHu3z4BZcStG3LC593cWA=="], + + "@opentelemetry/sdk-trace-base": ["@opentelemetry/sdk-trace-base@1.30.1", "", { "dependencies": { "@opentelemetry/core": "1.30.1", "@opentelemetry/resources": "1.30.1", "@opentelemetry/semantic-conventions": "1.28.0" }, "peerDependencies": { "@opentelemetry/api": ">=1.0.0 <1.10.0" } }, "sha512-jVPgBbH1gCy2Lb7X0AVQ8XAfgg0pJ4nvl8/IiQA6nxOsPvS+0zMJaFSs2ltXe0J6C8dqjcnpyqINDJmU30+uOg=="], + + "@opentelemetry/semantic-conventions": ["@opentelemetry/semantic-conventions@1.28.0", "", {}, "sha512-lp4qAiMTD4sNWW4DbKLBkfiMZ4jbAboJIGOQr5DvciMRI494OapieI9qiODpOt0XBr1LjIDy1xAGAnVs5supTA=="], + "@oxc-minify/binding-android-arm64": ["@oxc-minify/binding-android-arm64@0.93.0", "", { "os": "android", "cpu": "arm64" }, "sha512-N3j/JoK4hXwQbnyOJoEltM8MEkddWV3XtfYimO6jsMjr5R6QdauKaSVeQHO/lSezB7SFkrMPqr6X7tBfghHiXA=="], "@oxc-minify/binding-darwin-arm64": ["@oxc-minify/binding-darwin-arm64@0.93.0", "", { "os": "darwin", "cpu": "arm64" }, "sha512-kLJJe7uBE+a9ql6eLGAtJ1g1LuEXi4aHbsiu342wGe+wRieSPi/Cx0aeDsQjdetwK5mqJWjWS2FO/n03jiw+IQ=="],