Skip to content

Add comprehensive documentation for stage definitions, artifact binding, error handling, and parallel execution#1086

Merged
xbrianh merged 3 commits into
mainfrom
docs-parallel-artifact-clarifications
Jun 3, 2026
Merged

Add comprehensive documentation for stage definitions, artifact binding, error handling, and parallel execution#1086
xbrianh merged 3 commits into
mainfrom
docs-parallel-artifact-clarifications

Conversation

@xbrianh
Copy link
Copy Markdown
Owner

@xbrianh xbrianh commented Jun 3, 2026

Summary

Documents and clarifies previously missing or ambiguous behaviors for parallel pipeline execution, artifact binding, and stage definitions:

  • Parallel group execution: Added three-phase execution model (fan-out, concurrent, fan-in) and clarified failure handling—if any child bails, siblings continue until group completion and subsequent stages are skipped.
  • State isolation: Documented that each parallel child gets its own state directory, subprocess, and artifact bindings; parent state is not modified until fan-in completes.
  • Resume targeting: Clarified that gremlins resume accepts both group names and individual child names, with different semantics (re-spawn all children vs. resume only that child).
  • Base ref propagation: Documented automatic --base-ref propagation to child processes, ensuring consistent branching.
  • Stage definitions: Added worked example showing YAML anchor (&review-base) pattern for reusing stage configurations within a pipeline.
  • Artifact binding: Added examples and formal documentation for session:// scheme and how artifacts flow between stages via in: and out: maps.
  • Skip semantics: Clarified that skip creates a new sibling attempt with a fresh ID, avoiding confusing "Abandon" phrasing.

Also simplified gremlins clean argument parser to remove verbose help text.

xbrianh and others added 2 commits June 3, 2026 00:02
…ng, error handling, and parallel execution

This commit addresses the gap between the README and internal documentation by adding:

1. **Stage definitions** subsection — explains how to define and reuse stage patterns
   with stage-definitions: while maintaining the call-site-owns-IO principle for
   artifact bindings

2. **Artifact binding** subsection — documents in:/out: artifact declaration,
   artifact schemes (git://, session://), dotted-key resolution, and how definitions
   and call-sites compose

3. **Error handling and recovery** section — covers:
   - Bail semantics and common bail classes (reviewer_requested_changes, security, secrets, other)
   - Three recovery options: resume (re-spawn from bail), ack (assert externally fixed),
     skip (abandon and retry)
   - Parallel group failure handling and resume targeting
   - Boss-chain recovery workflows when children bail

4. **Enhanced Parallel groups** documentation — clarifies:
   - Three-phase execution: fan-out, concurrent run, fan-in
   - State isolation per-child and failure handling behavior
   - Resume targeting for groups and individual children
   - Automatic base_ref propagation to child processes

5. **Launch flags** clarification — documents base_ref propagation in parallel pipelines

These additions reflect recent commits (f6d0c8f, 41e6075, 5aaf066) that improved
parallel stage handling, bail semantics, and base_ref propagation, ensuring the
README documents current behavior.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Line 322-323: Clarified that subsequent stages are skipped when a group bails; operator uses CLI commands (resume/ack), not stages
- Line 392: Added YAML anchor (&review-base) to make merge-key example syntactically valid
- Lines 425, 430: Changed artifact binding example to use session://findings scheme consistently, eliminating confusing <name> shorthand syntax
- Lines 502-505: Clarified skip semantics to avoid contradictory "Abandon" phrasing; lead with accurate description of creating sibling attempt with fresh ID
Copy link
Copy Markdown
Owner Author

@xbrianh xbrianh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three factual errors in the new documentation sections — all fixable without ambiguity. The parallel execution behavior section also understates the effect of cancel_on_bail.

Comment thread README.md Outdated

**Artifact schemes:**
- `git://ref:path` — Git artifact: a file at `path` in the commit `ref` (e.g., `git://HEAD:report.json`)
- `session://key` — Session artifact: a value stored in the gremlin's state (used for intermediate data)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session:// is not a valid URI scheme. Uri.parse() rejects it — _BUILTIN_SCHEMES is {"file", "git", "gh"} (uri.py:7). Using session://findings in a pipeline YAML would raise ValueError: unknown scheme 'session' at runtime.

Session artifacts use the file://session/<name> form, which FileArtifactResolver handles (schemes.py:23-24). The example at lines 425 and 430 should use file://session/findings, and this bullet should read:

- `file://session/<name>` — file written to the gremlin's artifact directory during the run

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: changed to file://session/findings and updated the artifact schemes documentation to reflect valid URI formats.

Comment thread README.md Outdated
```

**Artifact schemes:**
- `git://ref:path` — Git artifact: a file at `path` in the commit `ref` (e.g., `git://HEAD:report.json`)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git://ref:path doesn't exist. GitResolver.read() handles three path prefixes (schemes.py:58-74):

  • git://range/<base>..<head> — commit log between two refs
  • git://ref/<name> — returns the ref name string
  • git://commit/<sha> — returns the SHA string

There is no file-at-ref support. git://HEAD:report.json would raise ValueError: unrecognised git URI path. The bullet and example should be corrected to reflect the actual supported forms.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: corrected to document the actual supported paths: git://ref/<ref>, git://commit/<sha>, and git://range/<base>..<head>.

Comment thread README.md Outdated
- The group halts after all currently-running children finish (not mid-run)
- The bail reason is attributed to the child stage name
- `gremlins resume <id>` re-spawns all children that haven't landed
- `gremlins resume <id> <child-name>` resumes only that child
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gremlins resume <id> <child-name> is not a real CLI form. resume_main() (cli/resume.py:15) only accepts a single positional gremlin_id. There is no second positional argument for a child name.

To resume a specific child, you use its actual gremlin ID, which is constructed as <parent-id>--<group-name>--<child-key> (visible in fleet view). The line should be corrected to describe this, e.g.:

- `gremlins resume <parent-id>--<group>--<child-name>` resumes only that child (use its full child gremlin ID from fleet view)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: corrected to show the proper CLI form using full child gremlin ID: gremlins resume <parent-id>--<group-name>--<child-key>

Comment thread README.md Outdated
3. **Fan-in** — when all children finish or any child bails, the group halts

If any child fails (raises `Bail`), the pipeline halts after the group finishes —
siblings are not cancelled mid-run. Subsequent stages are skipped; the operator
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Siblings are not cancelled mid-run" is only true for the default case (cancel_on_bail: false). The ParallelStage constructor accepts cancel_on_bail: true which calls _cancel_siblings() when any child bails, cancelling outstanding tasks immediately.

Similarly, bail_policy: "all" changes when a group bails — the group only raises Bail when all children bail, not just any one. Both options are parsed in with_dict() and should be mentioned here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the project’s README to document pipeline behaviors (parallel execution, stage definitions, artifact binding, and failure recovery) so operators can better understand how Gremlins runs and how to recover from bails.

Changes:

  • Expands README documentation for parallel groups (execution phases, isolation, resume targeting, base-ref propagation).
  • Adds new README sections describing stage-definitions reuse patterns and artifact in:/out: binding.
  • Adds a new README section for error handling/recovery commands (resume/ack/skip) and parallel/boss recovery flows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +316 to +320
**Execution and failure:** The parallel group executes in three phases:
1. **Fan-out** — each child stage starts independently as a subprocess
2. **Concurrent execution** — all children run simultaneously (up to `max_concurrent`)
3. **Fan-in** — when all children finish or any child bails, the group halts

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: clarified that siblings continue running by default, only cancelled immediately with cancel_on_bail: true. Also documented bail_policy: all option.

Comment thread README.md Outdated
Comment on lines +325 to +328
**State isolation:** Each child gets its own state directory and subprocess.
Client overrides, worktree paths, and artifact bindings are isolated per-child.
Children run in parallel without blocking each other, and parent-stage state is
not modified by child progress until fan-in completes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: corrected state isolation description to clarify that parent state.json IS updated during concurrent phase (active_children snapshot); only artifact binding copy is deferred until fan-in.

Comment thread README.md Outdated
Comment on lines +330 to +333
**Resume targeting:** `gremlins resume` accepts both the group name (`reviews`)
and individual child names (`review-detail`). Resuming a group re-spawns all
children that haven't landed; resuming a child resumes only that child from its
last recorded stage.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: corrected resume targeting to use the proper full child gremlin ID form visible in fleet view, not an abbreviated syntax.

Comment thread README.md Outdated
Comment on lines +408 to +411
Definitions provide `type`, `options`, and `prompt`. Call-sites own the
`name:`, `in:`, and `out:` keys. When a definition is used, the preprocessor
merges the call-site's keys onto the definition's dict, letting you reuse
common configurations while varying per-call bindings.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: clarified that call-sites override prompt/options via YAML anchors (as in the example) or template placeholders in multi-stage recipes. Documented that in: can be declared in definitions and merges with call-site values; out: is forbidden.

Comment thread README.md
Comment on lines +422 to +426
options:
cmds: ["python scan.py > report.json"]
out:
findings: session://findings

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: replaced invalid session:// scheme with file://session/, corrected exec command to write to `` directory, and documented proper artifact URI formats.

Comment thread README.md
Comment on lines +502 to +505
**`gremlins skip <id>`** — Create a new sibling attempt with the same parameters
and a fresh ID, leaving the failed gremlin in place. Use this for transient
failures (timeouts, CI hangs) that won't self-resolve. Both attempts are visible
in the fleet; the new attempt begins from the start.
Comment thread README.md Outdated
Comment on lines +509 to +513
When a child in a parallel group bails:
- The group halts after all currently-running children finish (not mid-run)
- The bail reason is attributed to the child stage name
- `gremlins resume <id>` re-spawns all children that haven't landed
- `gremlins resume <id> <child-name>` resumes only that child
Comment thread README.md
Comment on lines +515 to +516
If the cause was a transient failure affecting multiple children, `skip` the entire
group and re-launch the pipeline to restart all children.
Comment thread README.md
Comment on lines +400 to +404
- name: review-detail
<<: *review-base
prompt: [gremlins:code_style.md, detail_review.md]
- name: review-security
<<: *review-base
Comment thread README.md
Comment on lines +335 to +337
**Base ref propagation:** The `--base-ref` flag is automatically propagated from
the parent to all child processes, ensuring consistent branching across the group.
Child worktrees are derived from the parent's base_ref as recorded in state.
… bail semantics

- Correct parallel group fan-in semantics: siblings continue until completion, not immediate halt
- Add cancel_on_bail and bail_policy options documentation
- Fix state isolation description: parent state.json IS updated during concurrent phase
- Correct resume targeting: use full child gremlin ID, not abbreviated form
- Fix stage-definitions documentation: clarify prompt/options override via anchors, in/out behavior
- Replace invalid session:// scheme with correct file://session/ URIs
- Fix artifact URI schemes: document git://ref/, git://commit/, git://range/ formats
- Correct artifact binding syntax: {var} not {{var}}, in values are keys not URIs
- Fix bail semantics: describe as convention via BAIL markers, not runtime-enforced codes
- Clarify bail persistence: bail_<attempt>.json files, not state.json
- Update parallel group failure handling: correct CLI forms and add cancel_on_bail context

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@xbrianh xbrianh merged commit 980c99b into main Jun 3, 2026
1 check passed
@xbrianh xbrianh deleted the docs-parallel-artifact-clarifications branch June 3, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants