First-class source attribution: total-by-default node annotations with edge-aware lineage

## Summary

Make **source attribution / provenance** a first-class part of the node contract: every node output carries annotations describing where its data came from, defaulting to a sound coarse mapping (*every output field derives from all input fields and params*) that the engine synthesizes automatically. Because the engine owns the edge graph, per-node annotations then **compose into end-to-end workflow lineage** for free.

## Motivation

We want to answer "where did this output value come from?" for any field a workflow produces — back through the nodes that touched it, to the workflow's inputs and any external sources (URLs, models, APIs, DBs).

A purely node-local approach (each node records its own provenance as a side effect) produces **isolated annotation lists keyed by node id with no way to compose them**. The interesting capability — tracing a final output transitively across the DAG — is only possible in the engine, which already knows the edges between node outputs and downstream node inputs.

## Proposed design

### 1. Annotation data model (provenance-only for v1)

A frozen Pydantic model, e.g. `NodeOutputAnnotation`:

- `output_path`: path into the node's output (`list[PathComponent]`, where `PathComponent = str | int | SpanComponent`) identifying which output field/sub-region the annotation describes.
- `sources`: list of `Source`, each with:
  - `root`: a discriminated union of source roots — `InputRoot(input=<field>)`, `ParamRoot(param=<name>)`, and external roots like `UrlRoot`, `ModelRoot`, `ApiRoot`, `DbRoot`.
  - `path`: path into that source.
  - `verbatim: bool`, `confidence: float | None`.

`PathComponent` is an extensible discriminated union (by `type`), so domain-specific components (e.g. bounding boxes for document/OCR nodes) can be added downstream without touching core. Governance/classification concerns (data scope, PII categories, etc.) are intentionally **out of scope for v1** — they can propagate later through the same lineage edges.

### 2. Total-by-default, engine-synthesized

Do **not** require node authors to write annotations. The engine knows each node's `input_type`, `output_type`, and param fields, so for any node that supplies nothing it synthesizes the coarse default:

> every `output_field` ← (all `input_fields` ∪ all `param_fields`), `verbatim=False`, no confidence.

Honesty caveats to document:

- The default is **sound but imprecise**: it never under-reports declared inputs, but cannot infer spans or that a given output came from only a subset of inputs.
- **External roots cannot be defaulted** — the engine can't know a node hit a URL/API/model. Precision and external-source citation are always opt-in.

### 3. Node contract (non-breaking)

Let `run()` return *either* its bare output `Data` (as today) **or** a small wrapper carrying `(output, annotations)`. The executor (`Node.execute`, `src/workflow_engine/core/node.py`):

1. bare output → synthesize the coarse default;
2. wrapper → use the author's annotations, synthesizing defaults for any uncited output field.

This keeps every existing `run()` method compiling and returning bare output — they simply start getting sound coarse provenance for free — while precision nodes (templating, extraction, etc.) opt into the wrapper to declare spans/external sources.

> **Decided: non-breaking union** (return type becomes `Output | AnnotatedOutput`). Rationale: existing `run()` implementations keep working unchanged, and node authors who don't care about provenance never have to think about the wrapper — they return bare output and get the sound coarse default for free. Keeping the simple path simple for casual node authors outweighs the slightly less crisp return type of always-return-the-wrapper.

### 4. Edge-aware lineage composition (the payoff)

Add a resolver on the validated workflow / execution result that, given a node's `InputRoot(input="x")`, resolves `"x"` to the upstream `(node_id, output_field)` feeding that edge, and recurses — yielding transitive lineage from any final output back to workflow inputs and external roots. This also makes **subworkflow expansion** tractable: an expanded node's outer-output annotations are the inner-node annotations resolved against the outer node's input edges.

## Lifecycle / ownership notes

- The executor should own annotation init/teardown around each node's execution rather than relying on a context hook the subclass must remember to call `super()` on.
- Decide how annotations are surfaced to callers: extend the relevant hook signature and/or add an `annotations` field to `WorkflowExecutionResult`.
- Decide behavior on short-circuit / cached runs (when a context returns a memoized output instead of running the node).

## Suggested sequencing

1. Core annotation model + `NodeOutput`/`AnnotatedOutput` wrapper + executor synthesis of the coarse default. No existing node changes.
2. Edge-aware lineage resolver on the workflow/result. **(the payoff)**
3. Opt-in precision: convert a templating/extraction node to emit spans + external roots; expose annotations on `WorkflowExecutionResult`; allow downstream extension of the `PathComponent` union; revisit classification/governance propagation.

## Open questions

- ~~Crisp-breaking vs. non-breaking return contract (see §3).~~ **Resolved: non-breaking union.**
- Should the engine validate the head of `output_path` against actual output field names?
- How are annotations surfaced to external callers (hook signature vs. result field)?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First-class source attribution: total-by-default node annotations with edge-aware lineage #147

Summary

Motivation

Proposed design

1. Annotation data model (provenance-only for v1)

2. Total-by-default, engine-synthesized

3. Node contract (non-breaking)

4. Edge-aware lineage composition (the payoff)

Lifecycle / ownership notes

Suggested sequencing

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

First-class source attribution: total-by-default node annotations with edge-aware lineage #147

Description

Summary

Motivation

Proposed design

1. Annotation data model (provenance-only for v1)

2. Total-by-default, engine-synthesized

3. Node contract (non-breaking)

4. Edge-aware lineage composition (the payoff)

Lifecycle / ownership notes

Suggested sequencing

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions