Skip to content

feat: OpenTelemetry instrumentation (closes #34)#36

Open
SokratisVidros wants to merge 20 commits into
mainfrom
feat/otel-instrumentation
Open

feat: OpenTelemetry instrumentation (closes #34)#36
SokratisVidros wants to merge 20 commits into
mainfrom
feat/otel-instrumentation

Conversation

@SokratisVidros
Copy link
Copy Markdown
Owner

Summary

Adds a first-party otelPlugin that emits OpenTelemetry spans for workflow and step execution. @opentelemetry/api is an optional peer dependency — users who don't import the plugin pay zero runtime cost.

  • One pg_workflows.workflow.run span per worker execution, with child spans per step kind (step.run, step.waitFor, step.delay, step.waitUntil, step.pause, step.poll, step.invokeChildWorkflow, step.sleep aliased to delay).
  • Cache-hit suppression: steps replayed from the timeline after a pause emit no span (with the step.poll exception called out below).
  • Error path: failures get recordException + ERROR status; the original error is re-thrown so engine retry/DLQ behaviour is unchanged.
  • New optional wrap?(context, next) hook on WorkflowPlugin lets any plugin compose middleware around the workflow handler — the engine builds the chain in registration order.
  • WorkflowContext gains resourceId (optional) and attempt (required) so plugins can read them without a DB round-trip.

Design and plan

  • Spec: docs/superpowers/specs/2026-05-21-otel-instrumentation-design.md
  • Plan: docs/superpowers/plans/2026-05-21-otel-instrumentation.md

Out of scope for v1 (explicitly deferred)

  • Metrics (counters / histograms / queue depth)
  • Trace context propagation across child workflows and from external HTTP callers (both need durable storage; deferred together)
  • DLQ-only failure spans
  • step.poll cache-hit suppression — every poll execution emits a span (the test-helper fastForwardWorkflow pre-writes output, so a naive cache-hit guard would suppress legitimate spans). Trade-off documented in the design doc.

Test Plan

  • npm run test:unit — 138 passed, 1 skipped, 1 todo (16 new OTel tests across happy paths, error paths, cache-hit replay, span duration, plugin composition, and the isCachedHit predicate)
  • npm run build — clean
  • npm run lint — clean
  • Manual smoke: import otelPlugin, register a NodeSDK, run a workflow with a step.run + step.waitFor + step.invokeChildWorkflow, inspect traces in your collector of choice
  • Verify zero runtime cost when otelPlugin is not imported (no @opentelemetry/api resolution required)

🤖 Generated with Claude Code

SokratisVidros and others added 20 commits May 22, 2026 13:14
Records the v1 design decisions for adding OTel tracing support: a
first-party plugin shipped from pg-workflows with an optional peer dep on
@opentelemetry/api, a new wrap hook on WorkflowPlugin, per-execution span
lifetime, and cache-hit suppression for replayed steps. Metrics, cross-
execution context propagation, and DLQ spans are explicitly deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step-by-step TDD plan for the design committed in be52240. 15 bite-sized
tasks covering: package wiring, plugin interface extension, engine wrap
chain, OTel plugin (workflow.run + step.* spans, cache-hit suppression,
error path), tests with InMemorySpanExporter, README and AGENTS docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build a wrap chain from each plugin's optional wrap field in reverse
registration order so that the first-registered plugin is outermost.
Add a TDD test asserting the exact before/after call order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Capture startTime before awaiting step.run so spans reflect actual
  step execution time instead of near-zero post-completion duration.
- Save originalErr and re-throw it (not the coerced Error), matching
  the wrap hook pattern and preserving non-Error throw values.
- Add test asserting step.run span duration >= 30ms for a 50ms handler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a span for step.invokeChildWorkflow in the OTel plugin, emitting
exactly one span per invocation (first execution only) by detecting both
the cached-output case and the binding-key-only case (parent paused but
child not yet complete) as cache hits on resume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

step.sleep was not wrapped by the OTel plugin because spreading baseStep
copies the getter's value, not the getter itself — so sleep pointed to
the unwrapped delay. Added sleep to the methods return object, reusing
the 'delay' kind for semantic consistency. Added a unit test that verifies
step.sleep emits a pg_workflows.step.delay span.

Also corrected all snake_case span names in the OTel design spec
(wait_for, wait_until, invoke_child_workflow) to camelCase
(waitFor, waitUntil, invokeChildWorkflow) to match the implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the OTel design and plan files out of the repo — they were
development-process metadata, not user-facing docs. Their concrete output
lives in src/plugins/otel.ts and is exercised by the test suite. Add a
public docs/observability.md page covering span hierarchy, attributes,
cache-hit semantics, plugin composition, options, error semantics, and
explicit v1 deferrals. Wire the page into the README documentation index
and fix the design-doc link in the Observability section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SokratisVidros SokratisVidros force-pushed the feat/otel-instrumentation branch from 126bbdd to b3b3244 Compare May 22, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant