Summary
Make source attribution / provenance a first-class part of the node contract: every node output carries annotations describing where its data came from, defaulting to a sound coarse mapping (every output field derives from all input fields and params) that the engine synthesizes automatically. Because the engine owns the edge graph, per-node annotations then compose into end-to-end workflow lineage for free.
Motivation
We want to answer "where did this output value come from?" for any field a workflow produces — back through the nodes that touched it, to the workflow's inputs and any external sources (URLs, models, APIs, DBs).
A purely node-local approach (each node records its own provenance as a side effect) produces isolated annotation lists keyed by node id with no way to compose them. The interesting capability — tracing a final output transitively across the DAG — is only possible in the engine, which already knows the edges between node outputs and downstream node inputs.
Proposed design
1. Annotation data model (provenance-only for v1)
A frozen Pydantic model, e.g. NodeOutputAnnotation:
output_path: path into the node's output (list[PathComponent], where PathComponent = str | int | SpanComponent) identifying which output field/sub-region the annotation describes.
sources: list of Source, each with:
root: a discriminated union of source roots — InputRoot(input=<field>), ParamRoot(param=<name>), and external roots like UrlRoot, ModelRoot, ApiRoot, DbRoot.
path: path into that source.
verbatim: bool, confidence: float | None.
PathComponent is an extensible discriminated union (by type), so domain-specific components (e.g. bounding boxes for document/OCR nodes) can be added downstream without touching core. Governance/classification concerns (data scope, PII categories, etc.) are intentionally out of scope for v1 — they can propagate later through the same lineage edges.
2. Total-by-default, engine-synthesized
Do not require node authors to write annotations. The engine knows each node's input_type, output_type, and param fields, so for any node that supplies nothing it synthesizes the coarse default:
every output_field ← (all input_fields ∪ all param_fields), verbatim=False, no confidence.
Honesty caveats to document:
- The default is sound but imprecise: it never under-reports declared inputs, but cannot infer spans or that a given output came from only a subset of inputs.
- External roots cannot be defaulted — the engine can't know a node hit a URL/API/model. Precision and external-source citation are always opt-in.
3. Node contract (non-breaking)
Let run() return either its bare output Data (as today) or a small wrapper carrying (output, annotations). The executor (Node.execute, src/workflow_engine/core/node.py):
- bare output → synthesize the coarse default;
- wrapper → use the author's annotations, synthesizing defaults for any uncited output field.
This keeps every existing run() method compiling and returning bare output — they simply start getting sound coarse provenance for free — while precision nodes (templating, extraction, etc.) opt into the wrapper to declare spans/external sources.
Decided: non-breaking union (return type becomes Output | AnnotatedOutput). Rationale: existing run() implementations keep working unchanged, and node authors who don't care about provenance never have to think about the wrapper — they return bare output and get the sound coarse default for free. Keeping the simple path simple for casual node authors outweighs the slightly less crisp return type of always-return-the-wrapper.
4. Edge-aware lineage composition (the payoff)
Add a resolver on the validated workflow / execution result that, given a node's InputRoot(input="x"), resolves "x" to the upstream (node_id, output_field) feeding that edge, and recurses — yielding transitive lineage from any final output back to workflow inputs and external roots. This also makes subworkflow expansion tractable: an expanded node's outer-output annotations are the inner-node annotations resolved against the outer node's input edges.
Lifecycle / ownership notes
- The executor should own annotation init/teardown around each node's execution rather than relying on a context hook the subclass must remember to call
super() on.
- Decide how annotations are surfaced to callers: extend the relevant hook signature and/or add an
annotations field to WorkflowExecutionResult.
- Decide behavior on short-circuit / cached runs (when a context returns a memoized output instead of running the node).
Suggested sequencing
- Core annotation model +
NodeOutput/AnnotatedOutput wrapper + executor synthesis of the coarse default. No existing node changes.
- Edge-aware lineage resolver on the workflow/result. (the payoff)
- Opt-in precision: convert a templating/extraction node to emit spans + external roots; expose annotations on
WorkflowExecutionResult; allow downstream extension of the PathComponent union; revisit classification/governance propagation.
Open questions
Crisp-breaking vs. non-breaking return contract (see §3). Resolved: non-breaking union.
- Should the engine validate the head of
output_path against actual output field names?
- How are annotations surfaced to external callers (hook signature vs. result field)?
Summary
Make source attribution / provenance a first-class part of the node contract: every node output carries annotations describing where its data came from, defaulting to a sound coarse mapping (every output field derives from all input fields and params) that the engine synthesizes automatically. Because the engine owns the edge graph, per-node annotations then compose into end-to-end workflow lineage for free.
Motivation
We want to answer "where did this output value come from?" for any field a workflow produces — back through the nodes that touched it, to the workflow's inputs and any external sources (URLs, models, APIs, DBs).
A purely node-local approach (each node records its own provenance as a side effect) produces isolated annotation lists keyed by node id with no way to compose them. The interesting capability — tracing a final output transitively across the DAG — is only possible in the engine, which already knows the edges between node outputs and downstream node inputs.
Proposed design
1. Annotation data model (provenance-only for v1)
A frozen Pydantic model, e.g.
NodeOutputAnnotation:output_path: path into the node's output (list[PathComponent], wherePathComponent = str | int | SpanComponent) identifying which output field/sub-region the annotation describes.sources: list ofSource, each with:root: a discriminated union of source roots —InputRoot(input=<field>),ParamRoot(param=<name>), and external roots likeUrlRoot,ModelRoot,ApiRoot,DbRoot.path: path into that source.verbatim: bool,confidence: float | None.PathComponentis an extensible discriminated union (bytype), so domain-specific components (e.g. bounding boxes for document/OCR nodes) can be added downstream without touching core. Governance/classification concerns (data scope, PII categories, etc.) are intentionally out of scope for v1 — they can propagate later through the same lineage edges.2. Total-by-default, engine-synthesized
Do not require node authors to write annotations. The engine knows each node's
input_type,output_type, and param fields, so for any node that supplies nothing it synthesizes the coarse default:Honesty caveats to document:
3. Node contract (non-breaking)
Let
run()return either its bare outputData(as today) or a small wrapper carrying(output, annotations). The executor (Node.execute,src/workflow_engine/core/node.py):This keeps every existing
run()method compiling and returning bare output — they simply start getting sound coarse provenance for free — while precision nodes (templating, extraction, etc.) opt into the wrapper to declare spans/external sources.4. Edge-aware lineage composition (the payoff)
Add a resolver on the validated workflow / execution result that, given a node's
InputRoot(input="x"), resolves"x"to the upstream(node_id, output_field)feeding that edge, and recurses — yielding transitive lineage from any final output back to workflow inputs and external roots. This also makes subworkflow expansion tractable: an expanded node's outer-output annotations are the inner-node annotations resolved against the outer node's input edges.Lifecycle / ownership notes
super()on.annotationsfield toWorkflowExecutionResult.Suggested sequencing
NodeOutput/AnnotatedOutputwrapper + executor synthesis of the coarse default. No existing node changes.WorkflowExecutionResult; allow downstream extension of thePathComponentunion; revisit classification/governance propagation.Open questions
Crisp-breaking vs. non-breaking return contract (see §3).Resolved: non-breaking union.output_pathagainst actual output field names?