Skip to content

First-class source attribution: total-by-default node annotations with edge-aware lineage #147

@xujustinj

Description

@xujustinj

Summary

Make source attribution / provenance a first-class part of the node contract: every node output carries annotations describing where its data came from, defaulting to a sound coarse mapping (every output field derives from all input fields and params) that the engine synthesizes automatically. Because the engine owns the edge graph, per-node annotations then compose into end-to-end workflow lineage for free.

Motivation

We want to answer "where did this output value come from?" for any field a workflow produces — back through the nodes that touched it, to the workflow's inputs and any external sources (URLs, models, APIs, DBs).

A purely node-local approach (each node records its own provenance as a side effect) produces isolated annotation lists keyed by node id with no way to compose them. The interesting capability — tracing a final output transitively across the DAG — is only possible in the engine, which already knows the edges between node outputs and downstream node inputs.

Proposed design

1. Annotation data model (provenance-only for v1)

A frozen Pydantic model, e.g. NodeOutputAnnotation:

  • output_path: path into the node's output (list[PathComponent], where PathComponent = str | int | SpanComponent) identifying which output field/sub-region the annotation describes.
  • sources: list of Source, each with:
    • root: a discriminated union of source roots — InputRoot(input=<field>), ParamRoot(param=<name>), and external roots like UrlRoot, ModelRoot, ApiRoot, DbRoot.
    • path: path into that source.
    • verbatim: bool, confidence: float | None.

PathComponent is an extensible discriminated union (by type), so domain-specific components (e.g. bounding boxes for document/OCR nodes) can be added downstream without touching core. Governance/classification concerns (data scope, PII categories, etc.) are intentionally out of scope for v1 — they can propagate later through the same lineage edges.

2. Total-by-default, engine-synthesized

Do not require node authors to write annotations. The engine knows each node's input_type, output_type, and param fields, so for any node that supplies nothing it synthesizes the coarse default:

every output_field ← (all input_fields ∪ all param_fields), verbatim=False, no confidence.

Honesty caveats to document:

  • The default is sound but imprecise: it never under-reports declared inputs, but cannot infer spans or that a given output came from only a subset of inputs.
  • External roots cannot be defaulted — the engine can't know a node hit a URL/API/model. Precision and external-source citation are always opt-in.

3. Node contract (non-breaking)

Let run() return either its bare output Data (as today) or a small wrapper carrying (output, annotations). The executor (Node.execute, src/workflow_engine/core/node.py):

  1. bare output → synthesize the coarse default;
  2. wrapper → use the author's annotations, synthesizing defaults for any uncited output field.

This keeps every existing run() method compiling and returning bare output — they simply start getting sound coarse provenance for free — while precision nodes (templating, extraction, etc.) opt into the wrapper to declare spans/external sources.

Decided: non-breaking union (return type becomes Output | AnnotatedOutput). Rationale: existing run() implementations keep working unchanged, and node authors who don't care about provenance never have to think about the wrapper — they return bare output and get the sound coarse default for free. Keeping the simple path simple for casual node authors outweighs the slightly less crisp return type of always-return-the-wrapper.

4. Edge-aware lineage composition (the payoff)

Add a resolver on the validated workflow / execution result that, given a node's InputRoot(input="x"), resolves "x" to the upstream (node_id, output_field) feeding that edge, and recurses — yielding transitive lineage from any final output back to workflow inputs and external roots. This also makes subworkflow expansion tractable: an expanded node's outer-output annotations are the inner-node annotations resolved against the outer node's input edges.

Lifecycle / ownership notes

  • The executor should own annotation init/teardown around each node's execution rather than relying on a context hook the subclass must remember to call super() on.
  • Decide how annotations are surfaced to callers: extend the relevant hook signature and/or add an annotations field to WorkflowExecutionResult.
  • Decide behavior on short-circuit / cached runs (when a context returns a memoized output instead of running the node).

Suggested sequencing

  1. Core annotation model + NodeOutput/AnnotatedOutput wrapper + executor synthesis of the coarse default. No existing node changes.
  2. Edge-aware lineage resolver on the workflow/result. (the payoff)
  3. Opt-in precision: convert a templating/extraction node to emit spans + external roots; expose annotations on WorkflowExecutionResult; allow downstream extension of the PathComponent union; revisit classification/governance propagation.

Open questions

  • Crisp-breaking vs. non-breaking return contract (see §3). Resolved: non-breaking union.
  • Should the engine validate the head of output_path against actual output field names?
  • How are annotations surfaced to external callers (hook signature vs. result field)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions