From f8c79dc1e1620cdc92767005c1efcb6b50a2f6d9 Mon Sep 17 00:00:00 2001 From: "Marcus R. Brown" Date: Tue, 23 Jun 2026 22:46:45 -0700 Subject: [PATCH] docs: capture three learnings from the latest review-heavy PRs Add learnings on consolidating workflows without losing their invariants and caller contracts, writing requirements docs that survive verification, and safely deleting a superseded signal path, then tighten the cross-link graph among the privacy-gate docs so the related patterns reference each other. Closes #3582 Closes #3583 Closes #3584 Closes #3585 --- ...ning-and-fail-soft-telemetry-2026-06-04.md | 1 + ...mint-time-permission-scoping-2026-06-22.md | 3 + ...te-promotion-leak-prevention-2026-06-04.md | 17 +- ...-privacy-gates-shared-module-2026-06-22.md | 4 +- ...ts-doc-survives-verification-2026-06-24.md | 172 ++++++++++++++++++ ...-page-structured-attribution-2026-06-04.md | 3 +- .../doc-drift-cleanup-pattern-2026-04-18.md | 1 + ...y-workflow-side-privacy-gate-2026-05-16.md | 1 + ...ous-rollout-tracker-workflow-2026-06-17.md | 10 + ...-superseded-workflow-removal-2026-06-24.md | 141 ++++++++++++++ ...onsolidation-invariant-trace-2026-06-24.md | 152 ++++++++++++++++ 11 files changed, 501 insertions(+), 4 deletions(-) create mode 100644 docs/solutions/best-practices/requirements-doc-survives-verification-2026-06-24.md create mode 100644 docs/solutions/workflow-issues/safe-superseded-workflow-removal-2026-06-24.md create mode 100644 docs/solutions/workflow-issues/workflow-consolidation-invariant-trace-2026-06-24.md diff --git a/docs/solutions/best-practices/byte-exact-gateway-signing-and-fail-soft-telemetry-2026-06-04.md b/docs/solutions/best-practices/byte-exact-gateway-signing-and-fail-soft-telemetry-2026-06-04.md index 2090bf8ac..423ee551c 100644 --- a/docs/solutions/best-practices/byte-exact-gateway-signing-and-fail-soft-telemetry-2026-06-04.md +++ b/docs/solutions/best-practices/byte-exact-gateway-signing-and-fail-soft-telemetry-2026-06-04.md @@ -230,3 +230,4 @@ await emitPagesChanged(pagesChanged) - `docs/solutions/best-practices/diagnostic-patches-observability-discipline-2026-05-20.md` — stderr discipline; this applies the same discipline with the opposite operational goal (fail-soft, not fail-loud). - `docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md` — public-surface leak prevention; this adds the "don't render a placeholder public value" rule. - `docs/solutions/workflow-issues/github-actions-step-output-interpolation-2026-04-21.md` — pass step outputs through `env:`, never interpolate into `run:`. +- `docs/solutions/best-practices/requirements-doc-survives-verification-2026-06-24.md` — reconcile against the deployed verifier's code, not the design doc; the "live source over prose" discipline generalized to requirements documents. diff --git a/docs/solutions/best-practices/credential-mint-time-permission-scoping-2026-06-22.md b/docs/solutions/best-practices/credential-mint-time-permission-scoping-2026-06-22.md index 31dcdd69e..db7d1af33 100644 --- a/docs/solutions/best-practices/credential-mint-time-permission-scoping-2026-06-22.md +++ b/docs/solutions/best-practices/credential-mint-time-permission-scoping-2026-06-22.md @@ -97,3 +97,6 @@ Prose describes intent. Live data describes reality. Ground the review in realit - [Agent and automation steps need their GitHub token wired explicitly](../workflow-issues/required-github-token-for-agent-steps-2026-06-22.md) — the companion pattern: restrict capability by token scope, not by token absence; keep privileged operations in a separate step with its own credential. - [Survey workflow-side privacy gate](../security-issues/survey-workflow-side-privacy-gate-2026-05-16.md) — the same separate-step-with-its-own-credential pattern, where a privacy gate runs under a distinct token so the boundary is enforced by which step holds which credential. - [Diagnostic patches observability discipline](../best-practices/diagnostic-patches-observability-discipline-2026-05-20.md) — verify behavior against live state rather than assumptions; the same discipline applied to observability patches. +- [Writing a requirements doc that survives verification](requirements-doc-survives-verification-2026-06-24.md) — + validate a plan's security invariants against live data, not its prose; the same "ground the + review in reality" discipline applied to requirements documents. diff --git a/docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md b/docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md index 258c913e2..9ac0bd33b 100644 --- a/docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md +++ b/docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md @@ -1,7 +1,7 @@ --- title: Privacy Gate Design for Data→Main Promotion Leak Prevention date: 2026-06-04 -last_updated: 2026-06-04 +last_updated: 2026-06-24 verified: 2026-06-04 category: best-practices module: github-workflows @@ -194,3 +194,18 @@ return {ok: false, matchedFiles: redactedFiles} fallback path must reuse the exact fail-closed predicates of the main gate. - Issues: #3407 (wire the gate), #3408 (operator-actionable blocked output), #3429 (resolver PAT hygiene), #3430 (redact node_ids in failure output), #3424 (accepted commit-history exposure). + +## See also — privacy-gate correctness patterns + +- [Structured-first attribution for public-allowlist privacy gates](wiki-page-structured-attribution-2026-06-04.md) — + three-state frontmatter read (absent / present-but-malformed / present-with-URLs) and the + substring/prefix/truthy leak vectors that structured attribution closes. +- [Survey workflow-side privacy gate](../security-issues/survey-workflow-side-privacy-gate-2026-05-16.md) — + defense-in-depth at the dispatch boundary: the workflow is its own privacy boundary, not a + downstream consumer of someone else's gate. +- [Pure-core privacy gates with a shared module and mutation-proof tests](pure-core-privacy-gates-shared-module-2026-06-22.md) — + gate in the pure core before sensitive data enters shared state; one shared module so + chokepoints cannot diverge; mutation-proof tests that fail when the gate is removed. +- [Verify the whole public perimeter](../security-issues/verify-whole-public-perimeter-2026-06-22.md) — + enumerate every public surface before claiming a privacy invariant holds; a gate that covers + only the surfaces you thought of is not a gate. diff --git a/docs/solutions/best-practices/pure-core-privacy-gates-shared-module-2026-06-22.md b/docs/solutions/best-practices/pure-core-privacy-gates-shared-module-2026-06-22.md index a76805e01..ee39c3904 100644 --- a/docs/solutions/best-practices/pure-core-privacy-gates-shared-module-2026-06-22.md +++ b/docs/solutions/best-practices/pure-core-privacy-gates-shared-module-2026-06-22.md @@ -118,7 +118,7 @@ it('mutation proof: digest assembly drops private prose', () => { ## Related -- [Privacy-gate promotion leak prevention](../best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md) — the trusted-chokepoint and fail-closed-resolution patterns this gate extends into the pure core. -- [Wiki page structured attribution](../best-practices/wiki-page-structured-attribution-2026-06-04.md) — the present-but-empty vs absent distinction recurs here; encode it as a habit. +- [Privacy-gate promotion leak prevention](../best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md) — the trusted-chokepoint and fail-closed-resolution patterns this gate extends into the pure core. Its "See also — privacy-gate correctness patterns" section indexes the full cluster of gate docs, including this one. +- [Wiki page structured attribution](../best-practices/wiki-page-structured-attribution-2026-06-04.md) — the present-but-empty vs absent distinction recurs here; encode it as a habit. The three-state frontmatter read (absent / present-but-malformed / present-with-URLs) is the same discipline applied to structured provenance. - [Survey workflow-side privacy gate](../security-issues/survey-workflow-side-privacy-gate-2026-05-16.md) — verify privacy inside the trusted workflow before any public side effect. - [Private repo dispatch visibility gate](../security-issues/private-repo-dispatch-visibility-gate-2026-05-08.md) — the fail-closed predicate and opaque-identifier redaction this gate builds on. diff --git a/docs/solutions/best-practices/requirements-doc-survives-verification-2026-06-24.md b/docs/solutions/best-practices/requirements-doc-survives-verification-2026-06-24.md new file mode 100644 index 000000000..c6e5af6f1 --- /dev/null +++ b/docs/solutions/best-practices/requirements-doc-survives-verification-2026-06-24.md @@ -0,0 +1,172 @@ +--- +title: Writing a requirements doc that survives verification +date: 2026-06-24 +last_updated: 2026-06-24 +problem_type: best_practice +category: best-practices +component: development_workflow +module: github-workflows +severity: high +verified: 2026-06-24 +tags: + - requirements + - verification + - closed-schema + - dual-source + - sequencing + - planning +applies_when: + - writing or reviewing a requirements or brainstorm document + - the document touches a closed schema or a dual-source constraint + - a success criterion removes infrastructure while a replacement decision is still open + - a reviewer dissents and the dissent is at risk of being erased rather than recorded +--- + +# Writing a requirements doc that survives verification + +## Context + +Requirements docs that hand-wave get falsified in review. The strong ones make checkable claims: +every cited file, step, and contract can be verified against the live tree. The weak ones +describe intent in prose that sounds correct until someone opens the actual source. + +Two failure modes are especially costly because they are unrecoverable if violated: + +1. **Closed-schema dual-source misses.** A schema that is closed by construction — for example, + an event-gateway `EventType` union plus a `VALID_EVENT_TYPES` runtime set — requires updating + both sources in lockstep. A control-plane POST that sends an event type not present in both + sources becomes a hard `400`. The validator enforces completeness; it does not forgive a + partial update. + +2. **Removal-before-replacement gaps.** A success criterion that removes infrastructure (e.g., + a webhook poster) while an open question leaves the replacement's dormant-vs-live decision + unsettled can strand a daily run with no coverage in the window between removal and + replacement going live. + +Neither failure announces itself during planning. Both surface in production. + +## Guidance + +### 1. Make checkable claims + +Every cited file, step, and contract in the requirements doc must exist in the live tree. Verify +each one before the doc is considered complete: + +- "The gateway validates `EventType` against `VALID_EVENT_TYPES`" → open the gateway source and + confirm the union and the runtime set are both present and in sync. +- "The webhook poster runs daily at 06:00 UTC" → open the workflow file and confirm the cron. +- "Redacted entries use `node_id` as the stable join key" → read the live metadata and confirm + the shape. + +Prose describes intent. Live source describes reality. Ground the doc in reality. + +**Before (hand-wavy):** +> The gateway will validate the event type before posting. + +**After (checkable):** +> The gateway validates `event_type` against the `EventType` union in `src/types.ts` and the +> `VALID_EVENT_TYPES` set in `src/validate.ts`. Both must be updated when a new event type is +> added; a partial update produces a hard `400`. + +### 2. Name closed-schema dual-source constraints explicitly + +When a schema is closed by construction, the doc must name both sources and state the lockstep +requirement. "Update the schema" is not enough — it must say "update both `EventType` in +`src/types.ts` and `VALID_EVENT_TYPES` in `src/validate.ts`." + +The dual-source update pair is the unit of correctness. A reviewer who sees only one source +updated should block the change. + +```ts +// Both must be updated together — the validator enforces completeness +export type EventType = 'survey.completed' | 'invitation.accepted' | 'repo.archived' + +export const VALID_EVENT_TYPES = new Set([ + 'survey.completed', + 'invitation.accepted', + 'repo.archived', +]) +``` + +### 3. Record dissent rather than erasing it + +When a reviewer raises an objection that is heard but not adopted, record it as +advisory-but-not-adopted in the doc. The road-not-taken is part of the design record. A future +reader who encounters the same objection will know it was considered, not overlooked. + +> **Reviewer note (advisory, not adopted):** Suggested deferring removal of the webhook poster +> until the gateway replacement is confirmed live. Accepted the risk given the short window; +> tracked as a sequencing dependency in the open questions. + +Erasing the dissent makes the doc look more confident than it is and loses the context that +would help a future reader understand why the sequencing was chosen. + +### 4. Surface sequencing gaps between removal and replacement + +A success criterion that removes infrastructure while a replacement decision is still open is a +sequencing gap. Name it explicitly in the open questions section: + +> **Open question:** Is the gateway replacement live before the webhook poster is removed, or +> does a window exist where neither fires? If a window exists, what is the fallback for daily +> runs in that window? + +A gap that is named is a gap that can be closed. A gap that is implicit becomes a production +incident. + +## Why This Matters + +- **Closed-schema dual-source misses** produce hard `400`s that are unrecoverable without a + code change and redeploy. The validator enforces completeness; it does not forgive a partial + update. +- **Removal-before-replacement gaps** strand daily runs with no coverage. The window may be + short, but it is real, and it is invisible in the requirements doc unless named. +- **Erased dissent** removes the design record that would help a future reader understand why + a risky sequencing was chosen. The road-not-taken is part of the architecture. + +## When to Apply + +Apply this discipline when writing or reviewing any requirements or brainstorm document, +especially ones that: + +- Touch a closed schema or a dual-source constraint. +- Include a success criterion that removes infrastructure. +- Have a reviewer dissent that is at risk of being resolved by deletion. + +## Examples + +### Dual-source update pair + +A new event type `repo.archived` must appear in both sources: + +```ts +// src/types.ts — add to the union +export type EventType = 'survey.completed' | 'invitation.accepted' | 'repo.archived' + +// src/validate.ts — add to the runtime set +export const VALID_EVENT_TYPES = new Set([ + 'survey.completed', + 'invitation.accepted', + 'repo.archived', // ← must be added here too; omitting produces a hard 400 +]) +``` + +### Sequencing gap made explicit + +> **Success criterion:** Remove the legacy webhook poster from `owner/example-repo`. +> +> **Open question (sequencing):** The gateway replacement must be confirmed live and posting +> correctly before the webhook poster is removed. If the poster is removed first, the daily +> `survey.completed` event has no sender for the duration of the gap. Resolution: gate the +> removal PR on a confirmed gateway post in the prior 24 hours. + +## See also + +- [Byte-exact HMAC signing and fail-soft telemetry](byte-exact-gateway-signing-and-fail-soft-telemetry-2026-06-04.md) — + reconcile against the deployed verifier's code, not the design doc; the same "live source + over prose" discipline applied to a signing contract. +- [A second credential gives rotation isolation, not permission isolation](credential-mint-time-permission-scoping-2026-06-22.md) — + validate a plan's security invariants against live data, not its prose. +- [Verify the whole public perimeter](../security-issues/verify-whole-public-perimeter-2026-06-22.md) — + enumerate every public surface before claiming a privacy or security invariant holds. +- [Observability before structural change](observability-before-structural-change-2026-06-09.md) — + confirm the current behavior is observable before removing or replacing the path that produces it. diff --git a/docs/solutions/best-practices/wiki-page-structured-attribution-2026-06-04.md b/docs/solutions/best-practices/wiki-page-structured-attribution-2026-06-04.md index 273360673..df6bc6b50 100644 --- a/docs/solutions/best-practices/wiki-page-structured-attribution-2026-06-04.md +++ b/docs/solutions/best-practices/wiki-page-structured-attribution-2026-06-04.md @@ -140,7 +140,8 @@ if (structuredSources !== null) { - `docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md` — the companion promotion-diff gate; same fail-closed-under-uncertainty principle, different surface - (content scan vs slug attribution). + (content scan vs slug attribution). See also its "See also — privacy-gate correctness patterns" + index for the full cluster of gate docs. - `docs/solutions/security-issues/private-repo-dispatch-visibility-gate-2026-05-08.md` — the fail-closed-on-unknown predicate this attribution model extends. - `docs/solutions/security-issues/survey-workflow-side-privacy-gate-2026-05-16.md` — verifying diff --git a/docs/solutions/documentation-gaps/doc-drift-cleanup-pattern-2026-04-18.md b/docs/solutions/documentation-gaps/doc-drift-cleanup-pattern-2026-04-18.md index 56f2d1b7f..1dfe3700f 100644 --- a/docs/solutions/documentation-gaps/doc-drift-cleanup-pattern-2026-04-18.md +++ b/docs/solutions/documentation-gaps/doc-drift-cleanup-pattern-2026-04-18.md @@ -167,3 +167,4 @@ git log --oneline -15 # recent-change context - `.agents/skills/generating-project-docs/SKILL.md` — the skill this pattern operationalizes - `docs/solutions/runtime-errors/octokit-invitation-method-names-2026-04-17.md` — sibling discipline doc +- `docs/solutions/workflow-issues/safe-superseded-workflow-removal-2026-06-24.md` — the workflow-file analog: retiring a superseded workflow requires the same "no remaining callers, shared dependencies preserved" confirmation before deletion. diff --git a/docs/solutions/security-issues/survey-workflow-side-privacy-gate-2026-05-16.md b/docs/solutions/security-issues/survey-workflow-side-privacy-gate-2026-05-16.md index abc0c0b24..1210dded8 100644 --- a/docs/solutions/security-issues/survey-workflow-side-privacy-gate-2026-05-16.md +++ b/docs/solutions/security-issues/survey-workflow-side-privacy-gate-2026-05-16.md @@ -262,3 +262,4 @@ The breadth of the recheck gate matters as much as its existence. Routing the re - Related (diagnostic discipline): `docs/solutions/best-practices/diagnostic-patches-observability-discipline-2026-05-20.md` — the three-PR investigation that surfaced the stderr-swallowing, printf-marker, and App-token issues fixed inline above - Compliance: `docs/solutions/workflow-issues/github-actions-step-output-interpolation-2026-04-21.md` — env-var-only shell expansion pattern used in every new `run:` block - Compliance: `docs/solutions/runtime-errors/autonomous-pipeline-silent-failures-2026-04-19.md` — aggregate status semantics in `Survey Repo`; same principle of "downstream steps must check upstream conclusions, not assume agent success" +- Related (promotion-boundary gate): `docs/solutions/best-practices/privacy-gate-promotion-leak-prevention-2026-06-04.md` — the trusted-chokepoint and fail-closed-resolution patterns that extend this workflow-side gate to the `data→main` promotion boundary. diff --git a/docs/solutions/workflow-issues/autonomous-rollout-tracker-workflow-2026-06-17.md b/docs/solutions/workflow-issues/autonomous-rollout-tracker-workflow-2026-06-17.md index 7f3ae6610..6f4e4f63a 100644 --- a/docs/solutions/workflow-issues/autonomous-rollout-tracker-workflow-2026-06-17.md +++ b/docs/solutions/workflow-issues/autonomous-rollout-tracker-workflow-2026-06-17.md @@ -115,3 +115,13 @@ Bad split: - reusable tracker calls inherit every repository secret - tracker runs have no concurrency group - daily-digest announce steps run during custom tracker prompts + +## Related + +- [Safe workflow consolidation: trace every invariant and caller](workflow-consolidation-invariant-trace-2026-06-24.md) — + when collapsing two scheduled workflows into one, trace each safety invariant and remaining + caller explicitly; the concurrency-group and dispatch-entry-point patterns here are the + invariants that consolidation must preserve. +- [Safely deleting a superseded workflow or signal path](safe-superseded-workflow-removal-2026-06-24.md) — + the inverse lesson: before retiring a duplicate execution path, confirm no remaining callers, + full replacement coverage, shared dependencies preserved, and no dangling declarations. diff --git a/docs/solutions/workflow-issues/safe-superseded-workflow-removal-2026-06-24.md b/docs/solutions/workflow-issues/safe-superseded-workflow-removal-2026-06-24.md new file mode 100644 index 000000000..0b115f6aa --- /dev/null +++ b/docs/solutions/workflow-issues/safe-superseded-workflow-removal-2026-06-24.md @@ -0,0 +1,141 @@ +--- +title: Safely deleting a superseded workflow or signal path +date: 2026-06-24 +last_updated: 2026-06-24 +problem_type: best_practice +category: workflow-issues +component: development_workflow +module: github-actions-workflows +severity: medium +verified: 2026-06-24 +tags: + - github-actions + - workflow-removal + - signal-path + - shared-dependencies + - duplicate-paths +applies_when: + - removing a workflow file that has been superseded by a replacement + - a duplicate execution path (e.g. an old webhook poster running in parallel with its gateway replacement) is being retired + - a shared script or dependency is referenced by the workflow being removed and by other workflows + - secrets or step outputs tied to the deleted path may leave dangling declarations +--- + +# Safely deleting a superseded workflow or signal path + +## Context + +Removing a duplicate or superseded execution path — for example, an old webhook poster still +running in parallel with its gateway replacement, double-posting every event — is a correctness +fix. But the removal is only safe when four conditions are confirmed. Skipping any one of them +turns a safe wiring change into a capability regression or a dangling dependency. + +The most commonly missed check is the shared-dependency check: a script or helper referenced by +the workflow being removed may also be referenced by another workflow. Removing the workflow +without confirming this leaves the other workflow broken at runtime, not at merge time. + +## Guidance + +Removal is safe only when you confirm all four of the following. + +### 1. No remaining callers + +The removed workflow's only invoker was a job deleted in the same diff. Confirm with an explicit +grep before merging: + +```bash +grep -r 'uses:.*' .github/workflows/ +grep -r '' .github/workflows/ +``` + +An empty result is the proof. A non-empty result means the interface or script must survive for +its remaining caller — either kept in place or extracted into a shared location. + +### 2. The replacement fires on every condition the old path handled + +The replacement must cover every trigger condition the old path handled, so each event posts +exactly once. Map the old path's triggers (`schedule`, `workflow_dispatch`, `push`, etc.) against +the replacement's triggers and confirm coverage. A trigger present in the old path but absent +from the replacement leaves a gap. + +### 3. Shared dependencies survive for their other consumers + +A script, action, or helper referenced by the removed workflow may also be referenced by another +workflow. Preserving a shared script that another workflow references is a wiring change, not a +capability deletion — the script stays, only the caller is removed. + +This is the check most likely to be missed. The removed workflow's job list looks self-contained; +the shared dependency is invisible until the other workflow fails at runtime. + +### 4. Secrets and outputs leave nothing dangling + +A step-level secret or output whose only consumer was the deleted job should be cleaned up in +the same diff. Confirm that no orphaned secret declarations or output references remain in the +surviving workflow after the deletion. + +### Separate correctness from tidiness + +A benign leftover — for example, job-level outputs whose only consumer was the deleted job — is +later-pass cleanup, not a blocker. Do not hold the removal PR for cosmetic tidiness. Ship the +correctness fix; schedule the cleanup separately. + +## Why This Matters + +The "shared dependencies must survive for other consumers" check is easy to miss because the +removed workflow looks self-contained from the inside. The dependency is only visible from the +other workflow's perspective. A removal that passes this check is a safe wiring change; one that +misses it is a capability regression that surfaces at runtime with no obvious connection to the +removal. + +The "replacement fires on every condition" check closes the double-posting problem without +creating a coverage gap. Without it, the removal trades a duplicate-post bug for a missed-post +bug. + +## When to Apply + +Apply this checklist when deleting any workflow file or retiring any duplicate signal or +execution path. + +## Examples + +### Four-point removal checklist + +Before merging the removal PR, verify each item: + +- [ ] **No remaining callers** — `grep -r 'uses:.*'` and + `grep -r ''` both return zero results, or the interface/script is preserved + for any remaining caller. +- [ ] **Replacement coverage** — every trigger condition in the removed workflow (`schedule`, + `workflow_dispatch`, event filters) is present in the replacement. +- [ ] **Shared dependencies preserved** — any script, action, or helper referenced by the + removed workflow and by another workflow is kept in place; only the caller is removed. +- [ ] **No dangling declarations** — step-level secrets and outputs whose only consumer was the + deleted job are cleaned up in the same diff, or explicitly deferred as later-pass cleanup. + +### Shared-dependency wiring change + +```yaml +# owner/example-repo — .github/workflows/announce.yaml (surviving workflow) +# The old poster workflow also called scripts/post-event.ts. +# Removing the poster workflow does NOT remove scripts/post-event.ts — +# announce.yaml still calls it. Only the caller is removed. +- name: Post event to gateway + run: node scripts/post-event.ts + env: + GATEWAY_SECRET: ${{ secrets.GATEWAY_SECRET }} +``` + +## See also + +- [Autonomous rollout tracker workflows](autonomous-rollout-tracker-workflow-2026-06-17.md) — + the inverse lesson: don't create duplicate update paths; a dedicated caller workflow owns + mutations so two agents don't race on the same state. +- [Agent and automation steps need their GitHub token wired explicitly](required-github-token-for-agent-steps-2026-06-22.md) — + a related case where a removal (of a token) silently broke a step; the same "confirm no + remaining consumers" discipline applies. +- [Loose-then-tight schema migration pattern](../best-practices/loose-then-tight-schema-migration-pattern-2026-05-05.md) — + removal-leads-replacement sequencing: confirm the replacement is live before retiring the + old path. +- [Inventory-driven doc drift cleanup pattern](../documentation-gaps/doc-drift-cleanup-pattern-2026-04-18.md) — + the file-level analog: retiring a superseded file requires the same "no remaining references" + confirmation before deletion. diff --git a/docs/solutions/workflow-issues/workflow-consolidation-invariant-trace-2026-06-24.md b/docs/solutions/workflow-issues/workflow-consolidation-invariant-trace-2026-06-24.md new file mode 100644 index 000000000..c7f4c1e24 --- /dev/null +++ b/docs/solutions/workflow-issues/workflow-consolidation-invariant-trace-2026-06-24.md @@ -0,0 +1,152 @@ +--- +title: 'Safe workflow consolidation: trace every invariant and caller' +date: 2026-06-24 +last_updated: 2026-06-24 +problem_type: best_practice +category: workflow-issues +component: development_workflow +module: github-actions-workflows +severity: medium +verified: 2026-06-24 +tags: + - github-actions + - workflow-consolidation + - concurrency + - invariants + - dispatch-ergonomics + - safety +applies_when: + - two scheduled workflows are being collapsed into one + - a retired workflow provided a workflow_dispatch entry point or concurrency group + - a shared workflow_call interface is consumed by another workflow + - notification copy in the surviving workflow may no longer accurately describe its broadened scope +--- + +# Safe workflow consolidation: trace every invariant and caller + +## Context + +Collapsing two scheduled workflows into one — for example, merging a daily oversight pass and a +separate autoheal pass — is a surface-area reduction on paper. In practice, the retired workflow +carried safety and ergonomics that lived nowhere else. If those are not explicitly traced into the +surviving workflow, they vanish silently: no CI failure, no test red, no lint error. + +The risk is asymmetric. A merge that looks clean can silently drop: + +- **Dispatch ergonomics** — a `workflow_dispatch` manual entry point and its defaulted prompt + that operators relied on for ad-hoc runs. +- **Notification accuracy** — copy that described the retired workflow's narrower scope now + undersells the surviving workflow's broadened responsibility. +- **Concurrency isolation** — a dedicated `concurrency.group` in the retired workflow fenced its + heal pass from interactive runs on the same branch. Without it, scheduled runs keyed on + `github.run_id` no longer fence the heal pass from a concurrent manual dispatch. +- **Remaining callers of a shared interface** — a `workflow_call` interface must stay alive if + another workflow still consumes it (e.g., an `apply-branding` workflow that calls the + consolidated workflow). Retiring the interface breaks the caller silently. + +None of these failures surface in CI. They surface in production, under operator load, or during +an incident when the manual dispatch entry point is gone. + +## Guidance + +When consolidating workflows, surface area should go **down**, and you prove safety survives by +**tracing** — not assuming. + +### 1. Carry safety invariants verbatim, not by re-derivation + +Each safety invariant in the retired workflow — serial mutations, trusted-author gate, scope cap, +never-touch-main, never-weaken-guards — must be carried into the surviving workflow verbatim. +Re-deriving them from memory introduces drift. Copy the exact condition, then annotate why it +exists. + +### 2. Enumerate every remaining caller of any shared interface + +Before retiring a `workflow_call` interface, grep for every caller across the repository: + +```bash +grep -r 'uses:.*' .github/workflows/ +``` + +A caller that still references the interface will break silently at runtime, not at merge time. +If a caller exists, the interface must survive — either in the consolidated workflow or as a +thin shim that delegates to it. + +### 3. Preserve the dispatch entry point and its defaulted prompt + +A `workflow_dispatch` trigger with a defaulted prompt is an operator affordance. If the retired +workflow had one, the surviving workflow must carry it forward. Operators who relied on the +manual entry point for ad-hoc runs will find it gone with no error — just a missing option in +the Actions UI. + +### 4. Update notification copy to reflect broadened scope + +When a workflow absorbs a second workflow's responsibility, its notification copy (run name, +step summaries, issue comments) must be updated to describe the combined scope. Copy that +accurately described the retired workflow's narrower scope now undersells the surviving +workflow's responsibility and misleads operators reading the run log. + +### 5. Preserve or re-establish the concurrency group + +A dedicated `concurrency.group` in the retired workflow fenced its execution from concurrent +runs on the same branch. Without it, scheduled runs keyed on `github.run_id` are unique per +run — they do not fence the heal pass from a concurrent manual dispatch. If the retired +workflow had a named concurrency group, carry it forward explicitly: + +```yaml +concurrency: + group: -${{ github.ref }} + cancel-in-progress: false +``` + +## Why This Matters + +A consolidation that drops dispatch ergonomics, notification accuracy, or concurrency isolation +produces a workflow that is harder to operate and easier to race — without any signal that +something was lost. The invariant-trace discipline makes the loss visible at review time, not +at incident time. + +The "remaining callers" check is the easiest to miss: a `workflow_call` interface that looks +unused from inside the consolidated workflow may be the only entry point for a downstream +automation. An empty grep result is the proof; anything else is a blocker. + +## When to Apply + +Apply this checklist any time you merge or retire a scheduled workflow, or collapse two +automation passes into one. + +## Examples + +### Consolidation checklist + +Before merging the retirement PR, verify each item: + +- [ ] **Invariants verbatim** — each safety condition from the retired workflow is present in + the surviving workflow, copied not re-derived. +- [ ] **Remaining callers** — `grep -r 'uses:.*'` returns zero results, or + the interface is preserved for any caller that remains. +- [ ] **Dispatch entry point** — if the retired workflow had `workflow_dispatch`, the surviving + workflow carries it forward with the same (or updated) defaulted prompt. +- [ ] **Notification copy** — run name, step summaries, and issue comments accurately describe + the surviving workflow's combined scope. +- [ ] **Concurrency group** — if the retired workflow had a named `concurrency.group`, it is + carried forward or a new one is established for the consolidated pass. + +--- + +## Operational pitfall: the cheapest class of CI red + +A formatter-only lint failure — for example, Prettier wanting single quotes where double quotes +were used — is the cheapest CI red. Run the formatter or `eslint --fix` locally before pushing +so a mechanical reflow does not burn a review round. This is a sibling lesson observed in the +same class of consolidation changes, not a co-equal concern. + +## See also + +- [Autonomous rollout tracker workflows](autonomous-rollout-tracker-workflow-2026-06-17.md) — + the concurrency-group and dedicated-caller patterns that consolidation must preserve. +- [Agent and automation steps need their GitHub token wired explicitly](required-github-token-for-agent-steps-2026-06-22.md) — + a related case where a well-intentioned security change silently broke a step; the same + "trace the invariant" discipline applies. +- [A second credential gives rotation isolation, not permission isolation](../best-practices/credential-mint-time-permission-scoping-2026-06-22.md) — + verify security invariants against live source rather than prose; the same discipline applied + to consolidation safety.