Add comprehensive documentation for stage definitions, artifact binding, error handling, and parallel execution#1086
Conversation
…ng, error handling, and parallel execution
This commit addresses the gap between the README and internal documentation by adding:
1. **Stage definitions** subsection — explains how to define and reuse stage patterns
with stage-definitions: while maintaining the call-site-owns-IO principle for
artifact bindings
2. **Artifact binding** subsection — documents in:/out: artifact declaration,
artifact schemes (git://, session://), dotted-key resolution, and how definitions
and call-sites compose
3. **Error handling and recovery** section — covers:
- Bail semantics and common bail classes (reviewer_requested_changes, security, secrets, other)
- Three recovery options: resume (re-spawn from bail), ack (assert externally fixed),
skip (abandon and retry)
- Parallel group failure handling and resume targeting
- Boss-chain recovery workflows when children bail
4. **Enhanced Parallel groups** documentation — clarifies:
- Three-phase execution: fan-out, concurrent run, fan-in
- State isolation per-child and failure handling behavior
- Resume targeting for groups and individual children
- Automatic base_ref propagation to child processes
5. **Launch flags** clarification — documents base_ref propagation in parallel pipelines
These additions reflect recent commits (f6d0c8f, 41e6075, 5aaf066) that improved
parallel stage handling, bail semantics, and base_ref propagation, ensuring the
README documents current behavior.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Line 322-323: Clarified that subsequent stages are skipped when a group bails; operator uses CLI commands (resume/ack), not stages - Line 392: Added YAML anchor (&review-base) to make merge-key example syntactically valid - Lines 425, 430: Changed artifact binding example to use session://findings scheme consistently, eliminating confusing <name> shorthand syntax - Lines 502-505: Clarified skip semantics to avoid contradictory "Abandon" phrasing; lead with accurate description of creating sibling attempt with fresh ID
xbrianh
left a comment
There was a problem hiding this comment.
Three factual errors in the new documentation sections — all fixable without ambiguity. The parallel execution behavior section also understates the effect of cancel_on_bail.
|
|
||
| **Artifact schemes:** | ||
| - `git://ref:path` — Git artifact: a file at `path` in the commit `ref` (e.g., `git://HEAD:report.json`) | ||
| - `session://key` — Session artifact: a value stored in the gremlin's state (used for intermediate data) |
There was a problem hiding this comment.
session:// is not a valid URI scheme. Uri.parse() rejects it — _BUILTIN_SCHEMES is {"file", "git", "gh"} (uri.py:7). Using session://findings in a pipeline YAML would raise ValueError: unknown scheme 'session' at runtime.
Session artifacts use the file://session/<name> form, which FileArtifactResolver handles (schemes.py:23-24). The example at lines 425 and 430 should use file://session/findings, and this bullet should read:
- `file://session/<name>` — file written to the gremlin's artifact directory during the run
There was a problem hiding this comment.
Fixed: changed to file://session/findings and updated the artifact schemes documentation to reflect valid URI formats.
| ``` | ||
|
|
||
| **Artifact schemes:** | ||
| - `git://ref:path` — Git artifact: a file at `path` in the commit `ref` (e.g., `git://HEAD:report.json`) |
There was a problem hiding this comment.
git://ref:path doesn't exist. GitResolver.read() handles three path prefixes (schemes.py:58-74):
git://range/<base>..<head>— commit log between two refsgit://ref/<name>— returns the ref name stringgit://commit/<sha>— returns the SHA string
There is no file-at-ref support. git://HEAD:report.json would raise ValueError: unrecognised git URI path. The bullet and example should be corrected to reflect the actual supported forms.
There was a problem hiding this comment.
Fixed: corrected to document the actual supported paths: git://ref/<ref>, git://commit/<sha>, and git://range/<base>..<head>.
| - The group halts after all currently-running children finish (not mid-run) | ||
| - The bail reason is attributed to the child stage name | ||
| - `gremlins resume <id>` re-spawns all children that haven't landed | ||
| - `gremlins resume <id> <child-name>` resumes only that child |
There was a problem hiding this comment.
gremlins resume <id> <child-name> is not a real CLI form. resume_main() (cli/resume.py:15) only accepts a single positional gremlin_id. There is no second positional argument for a child name.
To resume a specific child, you use its actual gremlin ID, which is constructed as <parent-id>--<group-name>--<child-key> (visible in fleet view). The line should be corrected to describe this, e.g.:
- `gremlins resume <parent-id>--<group>--<child-name>` resumes only that child (use its full child gremlin ID from fleet view)
There was a problem hiding this comment.
Fixed: corrected to show the proper CLI form using full child gremlin ID: gremlins resume <parent-id>--<group-name>--<child-key>
| 3. **Fan-in** — when all children finish or any child bails, the group halts | ||
|
|
||
| If any child fails (raises `Bail`), the pipeline halts after the group finishes — | ||
| siblings are not cancelled mid-run. Subsequent stages are skipped; the operator |
There was a problem hiding this comment.
"Siblings are not cancelled mid-run" is only true for the default case (cancel_on_bail: false). The ParallelStage constructor accepts cancel_on_bail: true which calls _cancel_siblings() when any child bails, cancelling outstanding tasks immediately.
Similarly, bail_policy: "all" changes when a group bails — the group only raises Bail when all children bail, not just any one. Both options are parsed in with_dict() and should be mentioned here.
There was a problem hiding this comment.
Pull request overview
This PR updates the project’s README to document pipeline behaviors (parallel execution, stage definitions, artifact binding, and failure recovery) so operators can better understand how Gremlins runs and how to recover from bails.
Changes:
- Expands README documentation for parallel groups (execution phases, isolation, resume targeting, base-ref propagation).
- Adds new README sections describing
stage-definitionsreuse patterns and artifactin:/out:binding. - Adds a new README section for error handling/recovery commands (
resume/ack/skip) and parallel/boss recovery flows.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **Execution and failure:** The parallel group executes in three phases: | ||
| 1. **Fan-out** — each child stage starts independently as a subprocess | ||
| 2. **Concurrent execution** — all children run simultaneously (up to `max_concurrent`) | ||
| 3. **Fan-in** — when all children finish or any child bails, the group halts | ||
|
|
There was a problem hiding this comment.
Fixed: clarified that siblings continue running by default, only cancelled immediately with cancel_on_bail: true. Also documented bail_policy: all option.
| **State isolation:** Each child gets its own state directory and subprocess. | ||
| Client overrides, worktree paths, and artifact bindings are isolated per-child. | ||
| Children run in parallel without blocking each other, and parent-stage state is | ||
| not modified by child progress until fan-in completes. |
There was a problem hiding this comment.
Fixed: corrected state isolation description to clarify that parent state.json IS updated during concurrent phase (active_children snapshot); only artifact binding copy is deferred until fan-in.
| **Resume targeting:** `gremlins resume` accepts both the group name (`reviews`) | ||
| and individual child names (`review-detail`). Resuming a group re-spawns all | ||
| children that haven't landed; resuming a child resumes only that child from its | ||
| last recorded stage. |
There was a problem hiding this comment.
Fixed: corrected resume targeting to use the proper full child gremlin ID form visible in fleet view, not an abbreviated syntax.
| Definitions provide `type`, `options`, and `prompt`. Call-sites own the | ||
| `name:`, `in:`, and `out:` keys. When a definition is used, the preprocessor | ||
| merges the call-site's keys onto the definition's dict, letting you reuse | ||
| common configurations while varying per-call bindings. |
There was a problem hiding this comment.
Fixed: clarified that call-sites override prompt/options via YAML anchors (as in the example) or template placeholders in multi-stage recipes. Documented that in: can be declared in definitions and merges with call-site values; out: is forbidden.
| options: | ||
| cmds: ["python scan.py > report.json"] | ||
| out: | ||
| findings: session://findings | ||
|
|
There was a problem hiding this comment.
Fixed: replaced invalid session:// scheme with file://session/, corrected exec command to write to `` directory, and documented proper artifact URI formats.
| **`gremlins skip <id>`** — Create a new sibling attempt with the same parameters | ||
| and a fresh ID, leaving the failed gremlin in place. Use this for transient | ||
| failures (timeouts, CI hangs) that won't self-resolve. Both attempts are visible | ||
| in the fleet; the new attempt begins from the start. |
| When a child in a parallel group bails: | ||
| - The group halts after all currently-running children finish (not mid-run) | ||
| - The bail reason is attributed to the child stage name | ||
| - `gremlins resume <id>` re-spawns all children that haven't landed | ||
| - `gremlins resume <id> <child-name>` resumes only that child |
| If the cause was a transient failure affecting multiple children, `skip` the entire | ||
| group and re-launch the pipeline to restart all children. |
| - name: review-detail | ||
| <<: *review-base | ||
| prompt: [gremlins:code_style.md, detail_review.md] | ||
| - name: review-security | ||
| <<: *review-base |
| **Base ref propagation:** The `--base-ref` flag is automatically propagated from | ||
| the parent to all child processes, ensuring consistent branching across the group. | ||
| Child worktrees are derived from the parent's base_ref as recorded in state. |
… bail semantics
- Correct parallel group fan-in semantics: siblings continue until completion, not immediate halt
- Add cancel_on_bail and bail_policy options documentation
- Fix state isolation description: parent state.json IS updated during concurrent phase
- Correct resume targeting: use full child gremlin ID, not abbreviated form
- Fix stage-definitions documentation: clarify prompt/options override via anchors, in/out behavior
- Replace invalid session:// scheme with correct file://session/ URIs
- Fix artifact URI schemes: document git://ref/, git://commit/, git://range/ formats
- Correct artifact binding syntax: {var} not {{var}}, in values are keys not URIs
- Fix bail semantics: describe as convention via BAIL markers, not runtime-enforced codes
- Clarify bail persistence: bail_<attempt>.json files, not state.json
- Update parallel group failure handling: correct CLI forms and add cancel_on_bail context
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Summary
Documents and clarifies previously missing or ambiguous behaviors for parallel pipeline execution, artifact binding, and stage definitions:
gremlins resumeaccepts both group names and individual child names, with different semantics (re-spawn all children vs. resume only that child).--base-refpropagation to child processes, ensuring consistent branching.&review-base) pattern for reusing stage configurations within a pipeline.session://scheme and how artifacts flow between stages viain:andout:maps.Also simplified
gremlins cleanargument parser to remove verbose help text.