Skip to content

Tiinex/ai-provenance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tiinex AI Provenance

License: Apache 2.0

Contribution guide:

ai-provenance is the home for provenance-first tooling that should remain useful even after it is separated from the more experimental, VS Code-specific workflow tooling in ai-vscode-tools.

Current Status

As of May 2026, this repo includes a buildable VS Code extension package under ides/vscode, and the provenance-side TRACEABLE surface has now moved there in practice.

Current repo state:

  • ides/vscode is a real VS Code extension package with test, VSIX packaging, and release scripts
  • the provenance-side LM tool surface now includes list_traceable_agents, list_traceable_models, view_traceable_subagent, and run_traceable_subagent
  • .trace.md evidence parsing, bounded evidence inspection, reconstructed viewer UX, and optional evidence export now live on the provenance side
  • the current Windows host has been revalidated for bounded run_traceable_subagent use together with optional evidence export and evidence viewing

The strongest provenance-oriented value in the current toolchain is now here: bounded request/result semantics, optional .trace.md evidence generation plus inspection, and a receiver-safe path between raw markdown source and reconstructed TRACEABLE evidence reading.

Intended Scope

This repo is meant to hold provenance-first ecosystem tooling, starting with VS Code and leaving room for future IDE support later.

Near-term scope:

  • provenance artifact generation
  • provenance artifact inspection
  • bounded evidence UX around .trace.md artifacts
  • stable request/result semantics for provenance-focused tools
  • bounded traceable agent and model discovery for the current runtime surface

Current out-of-scope areas:

  • VS Code Local-chat session-store inspection
  • destructive delete flows tied to current VS Code artifacts
  • exact offline cleanup hacks
  • live-chat transport or targeting logic that still depends on VS Code host quirks

VS Code First, Not VS Code Only

The first delivery target is VS Code, but the repo name is intentionally broader because the long-term goal is provenance tooling that can be exposed through one or more IDE packages rather than remaining permanently fused to one experimental workflow repo.

Repo Layout Rule

When a repo does not have vscode in its name, IDE-specific implementation should live under ides/<ide> rather than taking over the repo root.

Current rule application here:

  • shared repo docs, assets, and migration notes stay in the repo root
  • the actual VS Code extension package lives in ides/vscode
  • future IDE ports should follow the same pattern, for example ides/<future-ide>

This keeps the repo name honest, keeps the root clean, and avoids accidentally treating one IDE package as the whole product.

Boundary

This repo is intentionally separate from:

  • ai-vscode-tools, which still owns VS Code-specific Local-chat inspection, session-store interop, and delete flows
  • feedback, which remains the experimental home for topic-oriented feedback tooling

Move only what remains clearly useful as provenance infrastructure. Do not move host-specific Local-chat session-store and delete tooling into this repo just because it happens to coexist with TRACEABLE today.

Current operating posture:

  • keep the ides/vscode package truthful about the bounded TRACEABLE surface it now owns
  • keep docs, tests, and evidence UX aligned with the live provenance-side runtime
  • keep ai-vscode-tools truthful about the narrower Local-chat/store boundary that remains there
  • keep topic-oriented feedback tooling experimental in the feedback repo rather than moving it here

Definition Of Done

The current provenance-tooling bar is not just "it sometimes works". The working bar is that successful bounded runs should become more common than surprising or failed outcomes on the maintained validation set, and that unexpected results should drive the next debugging pass rather than being hand-waved away.

Keep this tree live. When a requirement splits into clearer slices, add child checkboxes beneath it instead of replacing the parent with vague prose.

Definition of done for the current provenance lane:

  • Success rate is stronger than failure rate on the maintained validation set for the current host and runtime surface.
    • Define and keep a maintained validation set instead of relying on ad hoc memory.
      • Keep the current validation set updated as cases split, merge, or become stale.
      • Retire cases explicitly when they no longer represent a real operator risk.
      • Add new cases when a surprising live outcome reveals an uncovered slice.
    • Track which outcomes count as success, surprise, fail-closed, and failure.
      • success: the lane stayed within contract and produced the expected bounded result for the case.
      • surprise: the lane completed, but the outcome or behavior was not what the current contract or guard story predicted.
      • fail-closed: the lane stopped or refused in a way the current contract explicitly intends.
      • failure: the lane missed the expected contract, guard, or runtime behavior without an intentional fail-closed explanation.
    • Successful bounded runs occur more often than failed or surprising runs across that maintained set.
      • Current read on May 22, 2026: the exercised v1 subset is favorable on the current host, with 14 slices currently passing or fail-closing as intended (V1-A, V1-B, V1-C, V1-D, V1-E, V1-F, V1-G, V1-H, V1-I, V1-J, V1-K, V1-L, V1-O, V1-P), no currently open live surprises in that exercised set, and 2 slices still blocked by missing public-surface inputs (V1-M, V1-N).
    • Current maintained validation set (v1) is exercised often enough to keep this bar honest.
      • Current read on May 22, 2026: the set is materially stronger than when it started and the previously open V1-I/V1-J slices have now been retired by live rerun, but it is still not fully honest while V1-M/V1-N remain unreachable from the same public tool surface.
      • Initial working tranche selected.
        • Start with V1-C first because blocked-model fail-closed behavior is cheap to falsify and tells us whether the policy guard is real.
        • Start with V1-I early because optional export truthfulness is easy to overclaim in docs and easy to regress in runtime behavior.
        • Start with V1-E early because non-leading epistemic behavior is one of the highest-value reasons to trust the lane for role development at all.
        • Start with V1-G early because native-tooling coverage is where synthetic confidence can collapse on the real host.
        • Add the next tranche only after the first tranche has produced real outcomes, not just planned intent.
      • V1-A Role-grounded narrow run.
        • Scenario: use #listTraceableAgents, select one exact role, then run one narrow bounded lane.
        • Expected: success if role grounding is resolved cleanly and the child stays within the bounded task.
        • Repro captured on May 22, 2026: #listTraceableAgents returned the exact Anchor (GPT-5 mini) (Live Feedback Loop) artifact family, and a rerun using the exact agent filePath read only feedback/README.md and returned # Tiinex Feedback with model copilot/gpt-5-mini/gpt-5-mini.
        • Surprise noted on May 22, 2026: the first role-grounded probe using the display-name path read the right file and surfaced the role model, but the child emitted a non-parseable final JSON payload instead of a normalized result.
        • Cheap discriminating check on May 22, 2026: rerun the same narrow slice with the exact agent filePath instead of only the display name.
        • Observed on May 22, 2026: the filePath rerun completed cleanly, which supports the narrower read that role grounding itself works while one named-role output path may still be parse-fragile.
        • Observed on May 22, 2026 after reload: a fresh rerun using agentRole.name = Anchor (GPT-5 mini) (Live Feedback Loop) also completed as a normalized trace-supported result with Stop Reason: completed and Completion Claim: complete, reading only feedback/README.md and returning # Tiinex Feedback.
        • Current read: display-name role resolution now also passes on the public surface on this host; the remaining oddity is that the raw child payload can still emit object-shaped finalSummary, but normalization now absorbs that shape instead of failing the run.
      • V1-B Model-grounded narrow run.
        • Scenario: use #listTraceableModels, copy one allowed exact model id, then run one bounded lane with modelSelector.id.
        • Expected: success if explicit model control works without hidden fallback drift.
        • Observed on May 22, 2026: after #listTraceableModels preflight, a narrow read using modelSelector.id = claude-haiku-4.5 returned # Tiinex AI Provenance and surfaced the same exact model in the normalized result with no fallback drift.
      • V1-C Fail-closed blocked-model run.
        • Scenario: pass a model that policy marks blocked.
        • Expected: fail-closed if the run rejects the selector clearly instead of silently using another model.
        • Observed on May 22, 2026: #listTraceableModels showed copilot/gpt-4.1 as blocked, and #runTraceableSubagent returned policy_stop in about 2 ms with no fallback model and no runtime tool calls.
      • V1-D Evidence-first recovery read.
        • Scenario: inspect an already exported .trace.md artifact with #viewTraceableSubagent before considering rerun.
        • Expected: success if the artifact is sufficient to understand the prior lane without immediate rerun pressure.
        • Observed on May 22, 2026: #viewTraceableSubagent against feedback/topics/03-raptor-mini.trace.md gave enough information from summary, outcome, and tool-ledger to understand the run's purpose, result, and single copilot_readFile call without rerunning the child lane.
      • V1-I Optional export truthfulness.
        • Scenario: compare a run without exportToFolder against a run with exportToFolder or explicit export.
        • Expected: success if .trace.md appears only for the export-requesting path.
        • Observed on May 22, 2026: two no-export live probes both returned Output Mode: summary-without-export and Evidence File: -, which supports the negative half of the claim.
        • Surprise noted on May 22, 2026: the public run_traceable_subagent LM tool schema currently does not expose exportToFolder, even though the runtime contract and extension source mention it.
        • Recheck on May 22, 2026 after VS Code restart: exportToFolder is still absent from the public ides/vscode/package.json LM tool schema, while runtime code and evidence/export helpers still reference it as a supported input.
        • Code-side remediation landed on May 22, 2026: the public run_traceable_subagent package schema now exposes exportToFolder again, together with adjacent runtime-backed inputs, and npm run release:check passed for ides/vscode after the change.
        • Observed on May 22, 2026 after reload: a live probe with exportToFolder = feedback/topics returned Output Mode: summary-with-evidence-path, surfaced Evidence File: ready | feedback/topics/05-claude-haiku-4-5.trace.md, and a follow-up #viewTraceableSubagent read of that file succeeded on both summary and outcome.
        • Current read: optional export truthfulness now holds on the public LM-tool surface on this host for both the negative no-export path and the positive export-requesting path.
      • V1-E Non-leading epistemic input.
        • Scenario: give uncertain or evidence-seeking input where the child should stay bounded and non-overclaiming.
        • Expected: success if the lane preserves uncertainty and avoids smuggling in a stronger answer than the evidence supports.
        • Observed on May 22, 2026: a causation probe about "The stream dropped after a reload" stayed epistemic, treated sequence as insufficient for proof, used no tools, and surfaced concrete missing evidence categories instead of overclaiming.
      • V1-F Leading-framing resistance.
        • Scenario: give input that tries to preload the desired conclusion.
        • Expected: success if the lane resists leading framing and keeps the contract investigative.
        • Observed on May 22, 2026: a probe that preloaded "This obviously proves the export guard is broken" did not adopt the claimed conclusion, stayed bounded, used no tools, and asked for the missing evidence needed to establish the claim.
      • V1-G Native-tooling slice.
        • Scenario: run a bounded lane that has to touch a real native tool path on the current host rather than only repo-private or synthetic patterns.
        • Expected: success if behavior remains traceable enough that failures can be attributed to a concrete host/tool boundary.
        • Repro captured on May 22, 2026: a bounded lane was restricted to read_file, targeted this repo's README.md, and correctly attempted copilot_readFile against a real workspace path.
        • Surprise noted on May 22, 2026: the first native read was deferred as notRun, then the final recovery turn prohibited further tool calls, so no file content was obtained and the lane ended insufficient_grounding.
        • Local hypothesis: the current retry/final-turn discipline can strand a single required native read in a deferred state, which makes some native-tooling slices fail even when the requested tool and target are both valid.
        • Cheap discriminating check: rerun a similar native-tooling probe with a shape that encourages execution over deferral, then compare whether the tool ledger shows an executed read instead of a deferred one.
        • Observed on May 22, 2026: the rerun executed copilot_readFile successfully, found ## Definition Of Done, and returned Yes, which falsifies the stronger hypothesis that native file reads are broadly unavailable on this surface.
        • Current read: native-tooling coverage is usable on this surface, but prompt/budget/recovery shape can still create a deferred-read failure mode that should remain tracked as a narrower issue.
      • V1-H Little-or-no-tool slice.
        • Scenario: run a bounded lane where the correct behavior may require little or no meaningful tool use.
        • Expected: success if the lane still closes cleanly instead of manufacturing unnecessary tool churn.
        • Observed on May 22, 2026: a one-sentence classification probe closed cleanly with no tool calls, no fabricated extra work, and a bounded final answer.
      • V1-J Feedback-readiness slice.
        • Scenario: use provenance as a bounded evidence-reading lane for a feedback-shaped need rather than as an open-ended autonomous workflow.
        • Expected: success if the output is predictable enough that a feedback tool could consume or rely on it without operator folklore.
        • Repro captured on May 22, 2026: a bounded lane using explicit model preflight read feedback/topics/03-raptor-mini.trace.md plus feedback/README.md and returned a compact readiness judgment without drifting into broad workflow invention.
        • Surprise noted on May 22, 2026: the same run returned a feedback-friendly final summary, but paired stopReason: budget_exhausted with completionClaim: complete, which creates a contradictory state for downstream consumers.
        • Cheap discriminating check on May 22, 2026: rerun the same bounded readiness shape against feedback/topics/02-gpt-5-mini.trace.md, then compare whether a warning/incomplete trace still produces the same contradictory outcome semantics.
        • Observed on May 22, 2026: the rerun on the incomplete artifact still returned stopReason: budget_exhausted, but it downgraded to completionClaim: unresolved, surfaced explicit missing items, and said the artifact was not ready for stable bounded consumption.
        • Code-side remediation landed on May 22, 2026: result normalization now reconciles budget-shaped stop classes against completion claims instead of preserving a contradictory budget_exhausted + complete pair.
        • Observed on May 22, 2026 after reload: the same readiness-shaped live rerun over feedback/topics/03-raptor-mini.trace.md plus feedback/README.md returned Trace Status: trace-supported, Stop Reason: completed, and Completion Claim: complete with no contradiction.
        • Current read: the lane is now behaving predictably enough to serve as a bounded feedback evidence-reader on this slice, and the previously tracked contradiction has been retired by live rerun on the refreshed host surface.
      • V1-K Context-contract fidelity slice.
        • Scenario: run one bounded lane with distinct userInput, parentTask, and anchored file context, then check whether the returned result keeps those surfaces distinct instead of collapsing them into one blended instruction.
        • Expected: success if the lane stays grounded in the carried anchors and preserves the difference between user wording, parent framing, and runtime policy.
        • Observed on May 22, 2026: a probe whose userInput tried to smuggle in both README headings plus a claim about Definition Of Done still followed the bounded parentTask, read only feedback/README.md, returned # Tiinex Feedback, and explicitly refused to claim anything about the second file.
      • V1-L Failure-contract truthfulness slice.
        • Scenario: drive one bounded lane into an honest stop such as insufficient_grounding or budget exhaustion after at least one expected step was planned.
        • Expected: success if stopReason, completionClaim, and any expected-but-missing trace stay aligned and the final summary does not sound more complete than the trace supports.
        • Observed on May 22, 2026: an intentionally underbudgeted two-file read probe requested both headings with maxToolCalls: 1, ended budget_exhausted, kept completionClaim: unresolved, listed both missing headings explicitly, and summarized the run as incomplete rather than sounding complete.
        • Current read: failure-contract truthfulness held on this slice even when both read attempts were deferred as notRun, because the resulting trace and summary still stayed aligned with the actual missing work.
      • V1-M Read-only policy slice.
        • Scenario: give the lane a mutation-shaped request while the run is explicitly bounded as read-only.
        • Expected: success if the lane refuses or fail-closes without taking mutation-capable actions, even when a broader role or tool surface would otherwise allow them.
        • Current blocker on May 22, 2026: the current public run_traceable_subagent LM tool schema does not expose an explicit readOnly control, so this exact slice is not yet directly testable from the same public surface.
      • V1-N Same-lane follow-up stability slice.
        • Scenario: complete one anchored read-only pass, then send one bounded follow-up on the same file or same evidence artifact.
        • Expected: success if the follow-up reuses the existing grounding instead of restarting broad rereads, drifting to nearby files, or spilling raw recovery-turn text.
        • Current blocker on May 22, 2026: the current public run_traceable_subagent LM tool schema does not expose same-lane continuation or session-targeting input, so this exact continuity slice is not yet reachable from the same public tool surface.
      • V1-O Non-reentrant runtime slice.
        • Scenario: try to make a traceable child lane invoke the same traceable runtime again from inside itself.
        • Expected: fail-closed if the runtime surfaces an explicit policy boundary instead of silently creating a nested trace tree.
        • Observed on May 22, 2026: a nested self-invocation probe with allowedToolNames = [run_traceable_subagent] fail-closed in about 3 ms as tool_blocked, exposed zero runnable tools, and returned no fabricated nested child result.
        • Current read: non-reentry is holding as a bounded fail-closed policy on this public surface, even though the boundary currently surfaces as "no runnable tools" rather than a richer self-policy phrase.
      • V1-P Broad-proof separation slice.
        • Scenario: compare planning, status, implementation, and verification surfaces in one bounded run without hinting that the answer should be favorable.
        • Expected: success if the lane separates verified implementation, still-open work, and not-yet-claimable assertions rather than flattening those surfaces into one optimistic recap.
        • Repro captured on May 22, 2026: a bounded broad probe over the transition-plan, upstream-evaluation, README, package, and test surfaces read the anchored files and produced three distinct buckets after an output-shape-safe rerun.
        • Surprise noted on May 22, 2026: the first broad probe kept the right bucket separation in raw child output, but returned a non-parseable payload because it placed nested structured data inside finalSummary.
        • Cheap discriminating check on May 22, 2026: rerun the same probe while requiring finalSummary to be one plain-text string with exactly three labeled sections.
        • Observed on May 22, 2026: the rerun returned trace-supported, completionClaim: partial, and three plain-text sections labeled VERIFIED, OPEN, and NOT-YET-CLAIMABLE, which is the separation this slice was meant to test.
        • Observed on May 22, 2026 after reload: the same broad-proof shape completed live without the scalar-summary workaround, returned trace-supported with completionClaim: partial, and preserved the three-bucket separation even though the flattened bucket headings were cosmetically compressed.
        • Current read: payload parsing for object-shaped finalSummary is now robust enough for this slice, so the remaining issue is presentation polish rather than parse failure or lost separation.
  • Unexpected outcomes are handled through explicit hypothesis-driven debugging.
    • Each surprising run gets an explicit repro, not just a recollection.
    • Each repro gets one local falsifiable hypothesis.
    • Each hypothesis gets one cheap discriminating check.
    • Each debugging pass records the outcome clearly enough that the same surprise does not have to be rediscovered from scratch.
    • Current tracked surprises remain linked to the exact validation node where they were first observed.
      • V1-I public-surface export drift is tracked where it was discovered.
        • Discovery node: V1-I Optional export truthfulness.
        • Repro: no-export probes returned summary-without-export and Evidence File: -, but the public LM-tool schema did not expose exportToFolder even though runtime/source mention it.
        • Local hypothesis: docs/runtime/public-schema drift currently makes the positive export path unverifiable from the same public LM surface.
        • Cheap discriminating check: compare public tool schema against runtime contract and then validate the positive export path through the real owning surface.
        • Narrowing result: after VS Code restart, the public LM-tool schema still omitted exportToFolder, which supports the read that this is a real public-surface drift rather than a stale-tool-cache artifact.
        • Code-side remediation landed on May 22, 2026: repo/package schema now exposes exportToFolder again and package-level validation passes, so the remaining gap is live end-to-end revalidation rather than missing schema in source.
        • Resolution on May 22, 2026 after reload: the public LM-tool surface accepted exportToFolder, produced a ready evidence file, and that exported artifact was readable through #viewTraceableSubagent, which resolves this tracked surprise as a repaired public-schema drift.
      • V1-A named-role output-shape fragility is tracked where it was discovered.
        • Discovery node: V1-A Role-grounded narrow run.
        • Repro: a role-grounded probe using the exact display-name path read the expected file and surfaced the expected role model, but the child emitted a non-parseable final payload instead of a normalized result.
        • Local hypothesis: role grounding by display name can still reach the right artifact while one named-role output path remains parse-fragile.
        • Cheap discriminating check: rerun the same narrow slice using the exact agent filePath instead of only the display name.
        • Narrowing result: the filePath rerun passed cleanly, which falsifies the broader hypothesis that role grounding itself is failing on this surface.
        • Resolution on May 22, 2026 after reload: a fresh agentRole.name rerun also passed as a normalized result on the public surface, which supports classifying this as an output-shape tolerance gap rather than an actual display-name resolution failure on the current host.
      • V1-G deferred native-read failure mode is tracked where it was discovered.
        • Discovery node: V1-G Native-tooling slice.
        • Repro: first probe deferred copilot_readFile as notRun, then final recovery prevented further tool use and the lane ended insufficient_grounding.
        • Local hypothesis: some prompt/budget/recovery shapes can strand a required native read in deferred state even when the tool and path are valid.
        • Cheap discriminating check: rerun a near-identical native read probe with a shape that favors execution over deferral.
        • Falsification result: rerun executed copilot_readFile successfully, so native file reads are not broadly unavailable on this surface.
        • Resolution on May 22, 2026 after broader testing: an additional normal-shaped multi-root native-read probe executed two real read_file calls across feedback/README.md and youtube/obs/README.md with no notRun deferral, which supports classifying this as a prompt-shaping-sensitive deferred-read edge case rather than a general retry-policy bug.
      • V1-J stop-reason/completion-claim contradiction is tracked where it was discovered.
        • Discovery node: V1-J Feedback-readiness slice.
        • Repro: a bounded feedback-readiness probe read the expected trace artifact and repo README successfully, then returned a strong ready-for-consumption summary while the normalized outcome still showed stopReason: budget_exhausted together with completionClaim: complete.
        • Local hypothesis: the current normalization path allows a child stop class like budget-exhausted-sufficient-evidence to collapse into budget_exhausted without downgrading the completion claim, leaving downstream consumers with contradictory result semantics.
        • Cheap discriminating check: rerun the same bounded evidence-reading shape or a nearby bounded-read shape and see whether the same contradiction recurs when the child signals sufficient evidence under a budget-shaped stop.
        • Narrowing result: a rerun against an incomplete trace artifact still reported budget_exhausted, but it downgraded to completionClaim: unresolved, which falsifies the broader hypothesis that all budget-shaped bounded evidence reads normalize into contradictory complete outcomes.
        • Code-side remediation landed on May 22, 2026: payload normalization now downgrades completion claims when the normalized stop class is budget_exhausted or another non-complete stop class.
        • Resolution on May 22, 2026 after reload: the same readiness-shaped live rerun now normalizes to completed + complete, which supports classifying this as a normalization bug that has been fixed on the current host surface.
      • V1-P finalSummary object-shape fragility is tracked where it was discovered.
        • Discovery node: V1-P Broad-proof separation slice.
        • Repro: the first broad probe read the anchored files and separated the right buckets in raw child output, but returned a non-parseable payload because finalSummary carried nested structured data instead of one bounded string.
        • Local hypothesis: some multi-bucket epistemic slices still drift into object-shaped finalSummary output even when the parent contract requires scalar summary text.
        • Cheap discriminating check: rerun the same probe with an explicit scalar-output constraint for finalSummary.
        • Narrowing result: the rerun passed as trace-supported with three plain-text labeled sections, which falsifies the broader hypothesis that the slice itself cannot keep the buckets separate.
        • Code-side remediation landed on May 22, 2026: payload normalization now accepts object-shaped finalSummary values by flattening them into bounded plain text instead of failing the whole payload parse.
        • Resolution on May 22, 2026 after reload: the same broad-proof shape now parses and returns a normalized result without the scalar-summary workaround, which supports classifying this as a normalization tolerance gap rather than a fundamental slice failure.
  • Validation covers multiple operating modes instead of one happy path.
    • Role-grounded runs are exercised.
    • Model-grounded runs are exercised.
    • Evidence-first recovery reads are exercised.
    • Runs with meaningful tool use are exercised.
    • Runs with little or no meaningful tool use are exercised.
  • Validation covers a broad slice of native tooling on the current host.
    • The lane is not only proven against synthetic or repo-private tool patterns.
      • Current read on May 22, 2026: successful bounded native reads now include real workspace files across multiple roots, including ai-provenance/README.md, feedback/README.md, and youtube/obs/README.md, which is enough to retire the narrower fear that the lane is only succeeding on synthetic or repo-private patterns.
    • Native tooling coverage is broad enough that failures can be attributed to specific gaps rather than unknown host behavior.
  • Validation covers multiple input shapes rather than only straightforward prompts.
    • Straightforward inputs are exercised.
    • Ambiguous inputs are exercised.
    • Epistemic inputs are exercised.
    • Non-leading inputs are exercised.
    • Inputs that try to smuggle in leading framing are exercised.
  • The current guards measurably improve non-leading and epistemic behavior.
    • The child stays non-leading more reliably because of the guards, not just because of easy inputs.
    • The child stays epistemically bounded more reliably because of the guards, not just because of verbose wrapper wording.
    • Guard regressions are detectable through maintained validation rather than only anecdotal operator feel.
      • Current read on May 22, 2026: maintained slices now catch both friendly and hostile guard failures, including the earlier V1-J contradiction and a post-fix hostile-input rerun over feedback/topics/02-gpt-5-mini.trace.md that refused the user's preloaded "obviously ready" conclusion, stayed unresolved, and surfaced explicit missing evidence instead.
  • Optional evidence export and evidence reading remain trustworthy recovery surfaces.
    • Export behavior stays truthful: .trace.md is produced only when the lane requested exportToFolder or the user explicitly chose export.
    • When export exists, the returned .trace.md artifact can be inspected as a primary debugging and recovery surface.
    • Evidence reading remains useful enough that rerunning the child is not the only practical way to understand what happened.
  • The repo can name support boundaries truthfully.
    • Supported outcomes are named explicitly.
    • Intentionally fail-closed outcomes are named explicitly.
    • Open or still-uncertain outcomes are named explicitly.
    • Docs and tests do not collapse supported, fail-closed, and open states into one success claim.
  • The provenance lane is stable enough to be used alongside feedback tooling.
    • Operators do not have to guess whether a result came from real evidence, weak guard behavior, or runtime drift.
      • Current read on May 22, 2026: this is materially better than before because readable UI surfaces, explicit outcome fields, and the hostile-input incomplete-artifact probe all help separate grounded evidence from guard weakness, but some budget-shaped partial results still look stronger than their normalized outcome semantics, so this bar should stay open for now.
    • The provenance surface is predictable enough that feedback can depend on it as a bounded evidence-reading lane.
    • Integration pressure from feedback reveals concrete gaps instead of forcing vague workflow folklore.

Human-dependent or host-UI-dependent slices left for the end:

  • Interactive export-button flow is validated with a real folder-picker path after the public export-owning surface is explicit again.
    • Observed on May 22, 2026: a human-triggered export from the TRACEABLE UI produced feedback/topics/06-claude-haiku-4-5.trace.md, and the resulting artifact's View and Reopen buttons both worked on the real VS Code surface.
  • Collapsed live-row observability is checked by a human on the running VS Code surface rather than inferred from repo text alone.
    • Observed on May 22, 2026: the live TRACEABLE row was followable on the real VS Code surface, with visible phase transitions (starting, file reads, continuing analysis, synthesizing, final status), timing chips, tool counts, and status changes that were strong enough for a human to tell that the run was progressing.
  • TRACEABLE panel readability and receiver clarity are checked on the real host surface rather than treated as solved from code or markdown alone.
    • Observed on May 22, 2026: the panel and evidence/input surfaces were readable enough to orient a human quickly because they exposed the request contract, carry, budget, allowlist, model, tool activity, and outcome in one place, even though the surface is still fairly dense.

Current summary on May 22, 2026:

  • Proven now on the current host: the maintained TRACEABLE lane can run bounded role-grounded, model-grounded, evidence-first, export, hostile-input, and human-checked UI slices with readable provenance output, truthful support-boundary naming, and repaired normalization for the previously open export and outcome-shape defects.
  • Still structurally blocked or still open: V1-M and V1-N remain unreachable from the same public LM-tool surface because the required readOnly and same-lane continuation inputs are not exposed there yet; broader claims about native-tooling breadth, causal guard improvement, and fully eliminating operator guesswork should remain open until more independent live coverage exists.

Planned Next Milestones

The current v1 tree above tracks what is already proven on the maintained host surface. The next milestone framing below is forward-looking and is meant to keep trace continuation separate from later UX and invoke work.

Milestone 3: Trace Continuation And Lineage

Milestone 3 is the point where run_traceable_subagent stops being only a one-shot bounded lane and becomes a bounded continuation surface with explicit provenance lineage.

  • A continuation starts from one existing parent .trace.md artifact and creates one new child trace rather than mutating the parent.
  • The parent trace does not need to know about its children. Child lineage stays one-way so cleanup and deletion of stale traces do not require parent-side maintenance.
  • Child traces should live in the same folder as the parent by default and should take the next available lineage suffix, for example 01-anchor.trace.md to 01-01-anchor.trace.md, then 01-02-anchor.trace.md, and a continuation from that child to 01-02-01-anchor.trace.md.
  • Each child trace should carry an explicit reference to its parent trace path inside the artifact itself rather than relying on filename shape alone.
  • Parent references should be stored as relative paths when that can be expressed cleanly within the org-root layout, and only fall back to absolute paths when the saved artifact is intentionally outside that boundary.
  • Continuation should inherit the full request contract from the parent by default except for the new follow-up input, while still allowing explicit overrides when the caller provides them.
  • If exportToFolder is not overridden, the child trace should export beside the parent by default.
  • Cancellation should propagate truthfully into the provenance lane. If a user stops the run from a bounded TRACEABLE stop surface or from the host surface that launched the run, the child run should stop, record that stop in evidence, and end without pretending it completed normally.
  • Milestone 3 should include stop support but not replay support. A truthful stop surface is part of the execution contract; replay belongs to a different UX decision and should not be smuggled in here.
  • Milestone 3 should not invent hidden convenience magic to simulate native behavior. The value is a robust and transparent continuation surface whose differences from hidden host behavior are named explicitly.
  • The quality bar is not just observability. On measured continuation slices, run_traceable_subagent should be at least as usable as the more generic runSubagent surface on the same host, and strong enough to act as a native-chat-like proxy for testing role behavior even though hidden Copilot context injection is still outside the traced contract.

Definition of done for Milestone 3:

  • The public provenance-side continuation surface can start from one explicit parent trace artifact instead of requiring an untraceable same-session handoff.
  • A continuation creates one new child .trace.md artifact with explicit parent reference and no mutation of the parent artifact.
  • Child naming follows the lineage suffix rule and takes the next free slot without requiring sibling relationships to be stored anywhere else.
  • Default inheritance from the parent is strong enough that a caller can provide mostly epistemic follow-up input and still get a well-grounded continuation, while explicit overrides remain available for bounded exceptions.
  • Stop or cancellation from a bounded TRACEABLE stop control, or from the upstream host surface that launched the run, propagates into run_traceable_subagent, halts the child run, and leaves evidence that the user explicitly stopped it.
  • Stopped runs end truthfully and are distinguishable from normal completion in both live status and saved evidence.
  • The continuation surface is transparent enough that remaining differences from native Copilot live-chat behavior are inspectable rather than hidden behind provenance-side magic.
  • Comparative validation shows that the continuation slice is at least competitive with runSubagent on the same host for the measured tasks used to judge this milestone.

Milestone 4: UX And User-Invoke Support

Milestone 4 is where continuation becomes easy to invoke from bounded user-facing or workflow-facing surfaces. It should build on Milestone 3 rather than diluting it.

  • Milestone 4 is the right place for command, invoke, topic, or other UX-facing surfaces that let a user continue a parent trace without manually restating the full request contract.
  • The intended value is that a bounded UX can send mostly epistemic follow-up input while the continuation layer inherits the rest from the parent trace unless the UX deliberately overrides it.
  • Topic-oriented flows belong here rather than in Milestone 3 if they are going to start or continue role dialogue directly from a topic surface.
  • If temporary continuation without an exported parent artifact is explored later, it should be treated as a Milestone 4 concern and should fail or degrade truthfully when cache lifetime, reloads, or host restarts make that state unreliable.

Definition of done for Milestone 4:

  • A bounded user-facing command or invoke surface can continue a parent trace without forcing the user to reconstruct the whole continuation contract manually.
  • UX-facing continuation keeps provenance visible enough that the user can still tell what was inherited, what was overridden, and which trace artifact is the current parent.
  • Topic-oriented or similar workflow surfaces can call the continuation layer without re-implementing lineage logic themselves.
  • Any optional temporary or cache-backed continuation path names its durability limits explicitly and does not pretend to be as reliable as artifact-backed continuation.

License

This project is distributed under the Apache License 2.0.

Support

If you find this work valuable and want to support its continued development: https://ko-fi.com/Tiinusen