Skip to content

xbrianh/gremlins

Repository files navigation

gremlins

Background coding-agent pipelines that plan, implement, review, and land work end-to-end. Given a goal or GitHub issue, a gremlin runs the full plan → implement → review-code → address-code cycle unattended, writing artifacts to the per-user state directory resolved by platformdirs.user_state_dir("gremlins") and optionally opening a pull request. A fleet manager tracks running, stalled, and finished gremlins and provides stop / land / close operations.

Status: brand-new and a bit janky. This is a fresh project, actively shaped by daily use. Expect rough edges — stream timeouts, the occasional merge conflict from parallel gremlins, a few stages still finding their final shape. Bug reports, ideas, and PRs are all welcome.


Using gremlins with a coding assistant

Paste the output of gremlins prompt-for-assistant into a fresh Claude Code session (or any compatible assistant) to configure it as a competent gremlins collaborator.

The workflow: you discuss the work with the assistant, it captures discrete units as GitHub issues or plan files, launches gremlins in the background to implement them, and lands each finished gremlin before starting dependent work. You stay at the strategic level — deciding what to build and in what order — while gremlins handle the implementation cycle unattended. The assistant maintains a queue of running, pending, and blocked work and surfaces it on request.


Using gremlins across multiple repos

When you run gremlins launch, the launcher captures the current working directory's repo root via git rev-parse --show-toplevel and stores it as project_root in the gremlin's state.json. That value pins the worktree base, child process cwd, and pipeline discovery for that gremlin's lifetime.

To work on a different repo: cd there, then gremlins launch. There is no --project-root flag; the cwd at launch time is the contract.

Fleet view (gremlins) shows gremlins from all repos by default. Pass --here to filter to the current repo's project_root.

Pipeline discovery walks from the launching cwd, so .gremlins/pipelines/ overrides in each repo apply to gremlins launched from that repo.

Queue caveat: there is one global queue and the runner's cwd is frozen at gremlins queue run --detach time. To queue work against a different repo, prefix the command with cd:

gremlins queue add "cd /path/to/other-repo && gremlins launch gh --plan '#42' --wait"
gremlins queue add "cd /path/to/other-repo && gremlins land <id>"

State isolation: each gremlin's state lives under its own directory (resolved via platformdirs.user_state_dir("gremlins")/<id>/), so two repos can have running gremlins simultaneously without interference.


Runtime CLI prerequisites

Dev install

uv venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows
uv pip install -e ".[dev]"

Make targets

Target What it runs
make test pytest
make lint ruff check .
make format ruff format --check . (check only — does not rewrite files)
make typecheck pyright
make check lint + format + typecheck

CLI subcommands

Invoked as python -m gremlins.cli <subcommand> or gremlins <subcommand> after install. The authoritative list and per-subcommand description lives in the dispatch table in gremlins/cli/__init__.py.

Subcommand Purpose
launch <name> Launch a background gremlin by pipeline name (gremlins launch --list to see available)
resume Re-spawn an existing gremlin from its recorded stage
stop Send SIGTERM to a running gremlin and wait for it to exit
land Land a finished gremlin onto the current branch
rm Delete a dead gremlin's state dir, worktree, and branch
close Mark a dead gremlin as closed
log Tail the gremlin's log file
ack Acknowledge a gremlin waiting for human input
skip Skip a gremlin waiting for human input
queue Manage the gremlin launch queue
prompt-for-assistant Print the assistant setup prompt to stdout

_run-pipeline is an internal spawn boundary; not for direct use.

queue sub-subcommands

Sub-subcommand Description
add [--run] <command> Add a command to the queue; --run also starts the runner if idle
list [--watch] [--json] List queued items
run [--once] [--poll-interval SEC] [--detach] Start the queue runner
requeue [--done] Move failed (and optionally done) items back to pending
clear [--failed|--done|--pending|--purge|--item STEM] Remove items from the queue
set-state <state> --item STEM Manually transition a queue item to a different state
stop Stop the detached runner

Launch flags

Per-pipeline flags

Flags vary by pipeline. The first stage's __init__ signature defines the accepted flags; gremlins launch <name> --help prints the full list.

Common infrastructure flags (accepted by all pipelines):

Flag Default Description
--plan <path-or-ref> Path to a plan/spec file, or a GitHub issue ref (42, #42, owner/repo#42, or issue URL)
--description <text> Human-readable description stored in state
--parent <id> Parent gremlin ID (used by boss to track child ownership)
--print-id false Print the gremlin ID to stdout after launch
-c/--instructions <text> Instructions string (mutually exclusive with --plan)
--base-ref <ref> HEAD Git ref to branch the worktree from; ignored for gh pipelines (always anchors to origin default branch). In parallel pipelines, automatically propagated to all child processes.
--spec <path> Path to a coding-style spec file passed into stages
--bypass false Skip permission checks; run in bypass mode

Pipeline configuration

Gremlins runs a sequence of stages defined in a YAML file. The bundled pipelines work out of the box; a project-local YAML can override any of them.

Discovery order

--pipeline <name|path> resolves as follows:

  1. A value with a .yaml suffix or more than one path component is loaded directly as a filesystem path.
  2. Otherwise ./.gremlins/pipelines/<name>.yaml is checked first (project-local override).
  3. Then gremlins/pipelines/<name>.yaml (bundled) is checked.

The pipeline name is the first non-flag argument to gremlins launch. Run gremlins launch --list to see all available pipeline names.

Selecting a pipeline

gremlins launch local   # bundled local.yaml
gremlins launch gh      # bundled gh.yaml

Schema reference

Top-level keys:

name: my-pipeline         # optional; defaults to the file stem

default_client: claude:sonnet   # optional; provider:model string

prompt_dir: ../prompts          # optional; relative to YAML, defaults to the YAML's directory

stages:
  - name: plan
    type: plan
    client: copilot:gpt-5.4     # optional; overrides default_client for this stage
    prompt: gremlins:plan.md    # `gremlins:NAME` -> bundled prompts; bare NAME -> prompt_dir
    options: {}
Key Description
name Pipeline display name; defaults to the file stem
default_client provider:model string used for stages without an explicit client:
prompt_dir Directory that bare-name prompt: paths resolve against, relative to the YAML file. Defaults to the YAML's directory.
stages Ordered list of stage entries or parallel groups

Per-stage keys:

Key Description
name Unique stage identifier; used for resume targeting
type Registered stage type (see Available stage types)
client provider:model string; overrides default_client for this stage
prompt Path or list of paths. gremlins:NAME resolves from the bundled package prompts; a bare NAME resolves from the pipeline's prompt_dir.
options Free-form dict passed to the stage

provider:model format:

Providers: claude (default), copilot, openai, xai, anthropic. The model part is optional — claude: and claude:sonnet are both valid. Examples: claude:sonnet, copilot:gpt-5.4, openai:gpt-4o. Per-stage client: in YAML takes precedence over the CLI --client flag; default_client: at the pipeline level does not.

Parallel-group form:

- name: reviews
  parallel:
    - name: review-detail
      type: review-code
      client: claude:sonnet
    - name: review-security
      type: review-code
      client: claude:sonnet
  max_concurrent: 2         # optional; defaults to all children at once
Key Description
name Group identifier
parallel List of child stage entries (no nesting allowed)
max_concurrent Max simultaneously running children (optional)

Client specifiers

Clients are specified as provider:model inline strings, either at the pipeline level (default_client:) or per stage (client:). The model part is optional.

default_client: claude:sonnet     # all stages default to this
stages:
  - name: plan
    type: plan
  - name: implement
    type: implement
    client: copilot:gpt-5.4       # this stage uses copilot instead

Providers: claude, copilot, openai, xai, anthropic. The CLI --client provider:model flag overrides the pipeline-level default_client: but yields to per-stage client: settings.

prompt: field

prompt: gremlins:plan.md                                  # single bundled file
prompt: [gremlins:code_style.md, plan.md]                 # mix bundled and local; concatenated with \n\n

Each entry is one of:

  • gremlins:NAME — resolved from the bundled prompts shipped with the package. Use this for prompts owned by gremlins (code_style.md, plan_gh.md, etc.).
  • bare NAME — resolved from the pipeline's top-level prompt_dir: (relative to the YAML file; defaults to the YAML's own directory). Use this for prompts you author and check in alongside your pipeline.

Lists are joined with \n\n before being passed to the stage. There is no search fallback between the two — the prefix is the contract, so a custom YAML reads as self-describing about which prompts come from the package vs which must be provided locally.

By convention, project-local prompts live in ./.gremlins/prompts/ (a peer of ./.gremlins/pipelines/, not nested under it) and pipelines set prompt_dir: ../prompts.

options: field

A free-form dict passed verbatim to the stage. Selected options by stage (see gremlins/stages/AGENTS.md for the full list):

verify — runs a list of shell commands with an agent fix-loop:

options:
  cmds: ["make check", "make test"]  # commands to run (joined with &&)
  max_attempts: 3                    # fix-loop retries (default: 3)

For local stages, model options (plan_model, impl_model, address_model, test_fix_model, detail) can also be set here to override the CLI defaults.

Available stage types

Type Description
plan Produces an implementation plan
implement Applies the plan to the working tree
review-code Runs a code review and writes findings to disk
verify Runs check and test commands with an agent fix-loop
exec Runs shell commands with in:/out: artifact bindings
agent Resolves in: artifacts, renders prompt, invokes agent, verifies out: artifacts
handoff Runs the handoff agent once per boss loop iteration
loop Iterates body stages until a termination predicate or max iterations
sequence Runs body stages sequentially using child state
github-open-pull-request Opens a pull request on GitHub
github-request-copilot-review Requests a Copilot review on the open PR
github-wait-copilot Polls until Copilot posts its review
github-wait-ci Polls PR CI checks until they pass or exhaust attempts

Parallel groups

Wrap sibling stages in a parallel: list to run them concurrently:

default_client: claude:sonnet

stages:
  - name: plan
    type: plan

  - name: reviews
    parallel:
      - name: review-detail
        type: review-code
      - name: review-security
        type: review-code
    max_concurrent: 2

  - name: address-code
    type: agent

Execution and failure: The parallel group executes in three phases:

  1. Fan-out — each child stage starts independently as a subprocess
  2. Concurrent execution — all children run simultaneously (up to max_concurrent)
  3. Fan-in — all children finish or one bails; siblings continue running until group completion

If any child fails (raises Bail), the pipeline halts after the group finishes — siblings are not cancelled mid-run by default. This can be changed with cancel_on_bail: true to cancel outstanding tasks immediately. The bail is evaluated via bail_policy (default: any, meaning one failed child halts the group; set bail_policy: all to halt only when all children bail). Subsequent stages are skipped; the operator can resume or ack the group via CLI.

State isolation: Each child gets its own state directory and subprocess. Client overrides, worktree paths, and artifact bindings are isolated per-child. Children run in parallel without blocking each other. Parent state.json is updated during the concurrent phase (e.g., active_children snapshot); copying child artifact bindings into the parent registry is deferred until fan-in completes.

Resume targeting: Use the full child gremlin ID (form: <parent-id>--<group-name>--<child-key>, visible in fleet view) to resume a specific child. Resuming the parent group ID re-spawns all children that haven't landed.

Base ref propagation: The --base-ref flag is automatically propagated from the parent to all child processes, ensuring consistent branching across the group. Child worktrees are derived from the parent's base_ref as recorded in state.

Worked example: project-local override

Create .gremlins/pipelines/local.yaml to override the bundled local pipeline. This example uses Opus for plan/implement/address stages and adds a verify stage before review-code:

name: local

stages:
  - { type: plan,         options: { plan_model: opus } }
  - { type: implement,    options: { impl_model: opus } }
  - { type: verify,       options: { cmds: ["pytest"] } }
  - { type: review-code }
  - { name: address-code, type: agent, options: { address_model: opus } }

Add a prompt: key to any stage to supply a custom prompt; paths are relative to the YAML file.

Worked example: parallel reviewers

Run two review-code passes in parallel, then address both:

name: local

default_client: claude:sonnet

stages:
  - { type: plan }
  - { type: implement }

  - name: reviews
    parallel:
      - name: review-detail
        type: review-code
      - name: review-security
        type: review-code
    max_concurrent: 2

  - { name: address-code, type: agent }

Note: review-code does not currently support per-stage prompt overrides via YAML — both passes use the built-in detail lens.

Stage definitions

YAML stage-definitions: lets you name and reuse stage patterns within a pipeline:

stage-definitions:
  review-base: &review-base
    type: review-code
    client: claude:sonnet
    prompt: gremlins:code_style.md

stages:
  - { type: plan }
  - { type: implement }
  - name: review-detail
    <<: *review-base
    prompt: [gremlins:code_style.md, detail_review.md]
  - name: review-security
    <<: *review-base
    prompt: security_review.md

Definitions provide base type, options, and prompt. Call-sites can override prompt and options via YAML anchors (as shown above) or via template placeholders in multi-stage recipes. Call-sites own the name:, in:, and out: keys; out: is forbidden inside a definition, but in: can be declared and will be merged with call-site in: values. For single-stage definitions, only name, in, and out keys can be safely overridden; to vary prompt or options, use anchors.

Artifact binding

Stages can bind artifacts via in: and out: maps. These define what data flows between stages in the pipeline:

stages:
  - name: scan
    type: exec
    options:
      cmds: ["python scan.py > $ARTIFACTS/report.json"]
    out:
      report: file://session/report

  - name: analyze
    type: agent
    in:
      report: report
    prompt: |
      The scanning report is in {report}.
      Propose fixes.

Artifact URI schemes:

  • file://session/<name> — Session artifact: a file created under the gremlin's $ARTIFACTS directory
  • git://ref/<ref> — Git ref name (e.g., git://ref/main returns the string main)
  • git://commit/<sha> — Commit SHA (e.g., git://commit/abc123def returns the full SHA)
  • git://range/<base>..<head> — Commit range/log between two refs
  • gh://pulls/<number>/head — GitHub PR head ref (and other gh:// schemes for GitHub data)
  • file://, git://, gh:// — File artifact resolvers support these base schemes

Artifact binding semantics:

  • in: values are registry key paths (e.g., report or report.critical?default) with optional dotted attribute access and ?default fallback
  • out: values are URI strings that name what the stage produces; downstream stages reference the key name (not the URI) in their in: maps
  • Prompt/option substitution uses {var} tokens (not {{var}}); artifacts bound via in: become available for substitution
  • in: can be declared in a stage definition and will be merged with call-site in: values; out: cannot appear inside a definition

Stage definitions and bundled recipes

Some stage types are not built-in — they are provided as bundled YAML recipes and must be wired in via stage-definitions: before use:

stage-definitions:
  github-push-to-pr-branch: gremlins:github_push_to_pr_branch

stages:
  - { name: push, type: github-push-to-pr-branch }

gremlins:NAME resolves the recipe from the bundled package (gremlins/recipes/stages/NAME.yaml). A bare path resolves relative to the pipeline file.

Bundled pipelines

The canonical reference pipelines:

Error handling and recovery

Gremlins can fail or get stuck during execution. Understanding how to recover is essential for running long-running pipelines.

Bail semantics

When a stage detects an unrecoverable condition (e.g., a code review requests changes, secrets are detected, or a merge conflict blocks progress), it raises a Bail exception with a detail string.

By convention, agent-based stages emit a BAIL: <class>: <detail> marker at the end of their output. The <class> token is conventionally one of:

  • reviewer_requested_changes — code review found issues that must be addressed
  • security — security review detected problems
  • secrets — credentials or sensitive data detected in the code
  • other — stage-specific or unknown failure condition

The bail detail is written to a per-attempt bail_<attempt>.json file in the gremlin's state directory and is visible in the fleet view. When a stage bails, the entire pipeline halts — subsequent stages do not run, but the gremlin's state is preserved for recovery.

Recovering from gremlin failures

When a gremlin bails and halts, you have three recovery options:

gremlins resume <id> — Re-spawn the bailed gremlin from the stage where it bailed. Use this when the cause has been fixed externally (e.g., a code review fix has been merged, or a merge conflict has been resolved). The gremlin will restart from the bailed stage with the current worktree state.

gremlins ack <id> — Acknowledge the gremlin without re-running. Use this when the bailed condition is acceptable (e.g., the review found minor style issues that don't block landing, or external work was already completed). The gremlin marks the bailed stage as complete and proceeds to subsequent stages.

gremlins skip <id> — Create a new sibling attempt with the same parameters and a fresh ID, leaving the failed gremlin in place. Use this for transient failures (timeouts, CI hangs) that won't self-resolve. Both attempts are visible in the fleet; the new attempt begins from the start.

Handling parallel group failures

When a child in a parallel group bails:

  • The group halts after all currently-running children finish (not mid-run), unless cancel_on_bail: true
  • The bail reason is attributed to the child stage name
  • gremlins resume <parent-id> re-spawns all children that haven't landed
  • gremlins resume <parent-id>--<group-name>--<child-key> resumes only that child (use the full child ID from fleet view)

If the cause was a transient failure affecting multiple children, skip the entire group and re-launch the pipeline to restart all children.

Boss-chain recovery

When a boss gremlin spawns child gremlins (gremlins launch ... --parent <boss-id>), the boss halts if a child bails. At this point:

  • The child's gremlin ID is visible in the fleet view as a child of the boss
  • Recover the child (resume, ack, or skip) independently
  • Once the child lands or is abandoned, resume the boss (gremlins resume <boss-id>)

The boss resumes from its child-spawn stage and proceeds with the next iteration (re-planning, re-implementing, or wrapping up, depending on the pipeline).

What can a gremlin do to my machine?

Gremlins operate in one of two permission modes:

Default mode (no flags): The agent is restricted to an allowlist of tools (Read, Edit, Write, Bash, Grep, Glob) and its Bash commands are path-scoped to the gremlin's git worktree. It can read and modify files inside that worktree and blocks direct path references outside it. This is a best-effort token check, not a full sandbox — indirect references (heredocs, computed paths) may not be caught.

Bypass mode (--bypass, GREMLINS_BYPASS_PERMISSIONS=1, project .gremlins/permissions.yaml bypass_permissions: true, or user config ~/.config/gremlins/config.toml bypass_permissions = true): All permission checks are disabled. The agent can use any tool and reference any path. Use this when the task genuinely requires broader access (e.g. a pipeline that modifies system config).

The three opt-in paths for bypass are:

  1. gremlins launch <pipeline> --bypass — single-launch override
  2. GREMLINS_BYPASS_PERMISSIONS=1 in the environment
  3. bypass_permissions: true in .gremlins/permissions.yaml (project) or bypass_permissions = true in ~/.config/gremlins/config.toml (user)

Honest disclaimer: The allowlist limits reach — what paths and tools the agent can invoke. It does not limit impact within reach. A gremlin with write access to your worktree can make any change inside it. Review landed commits before merging.

Backend differences: On openai: and xai: backends, gremlins owns the tool layer and enforces the allowlist directly. On the anthropic: backend, enforcement is coarser — the SDK loop uses vendor-defined tools and the path scoping is advisory. On claude: and copilot: subprocess backends, the gremlins-layer permission block is not translated into CLI flags or settings — the underlying CLI reads the operator's ambient config and enforces whatever the operator has configured there. See "Backend config inheritance" below.

Backend config inheritance

The claude: backend is a thin wrapper around claude -p. It does not materialize a per-gremlin config dir, and it does not set CLAUDE_CONFIG_DIR for the subprocess. Whatever the operator has configured for their interactive Claude session is exactly what the subprocess sees:

  • Settings~/.claude/settings.json (plus any project-level .claude/settings.json the CLI discovers) is read by the CLI directly. The gremlins-layer allowed_tools / disallowed_tools block has no effect on claude: runs; configure tool permissions via your own Claude settings or use the anthropic: backend.

    Gremlin worktrees — where the claude: subprocess does its file edits — live under a stable, gremlins-scoped prefix in the system temp directory. Discover it at runtime:

    python -c "from gremlins import paths; print(paths.work_root())"
    

    On Linux/macOS this is /tmp/gremlins; the OS reclaims orphaned worktrees on reboot. A single permissions.allow rule in ~/.claude/settings.json covers every worktree path:

    {
      "permissions": {
        "allow": [
          "Edit(<work_root>/**)",
          "Write(<work_root>/**)",
          "Read(<work_root>/**)"
        ]
      }
    }

    Replace <work_root> with the actual output of the command above.

  • MCP servers and hooks — inherited from the user's Claude config.

  • Auth — subscription auth follows ~/.claude/.credentials.json (or the macOS keychain) exactly as it would for an interactive session.

  • Permission mode — the only thing the wrapper still controls per call: --permission-mode bypassPermissions when bypass is enabled, otherwise default.

True process isolation: use an SDK backend

If you need per-gremlin tool allow-lists, hermetic config, or a clean separation between gremlins and your interactive Claude session, use one of the SDK-backed providers instead:

  • anthropic:<model-id>claude-agent-sdk with setting_sources=[] (no ambient settings, no MCP, no hooks). Requires ANTHROPIC_API_KEY. allowed_tools from the native block is enforced by the SDK.
  • openai:<model-id> / xai:<model-id>openai-agents SDK with the in-tree GREMLINS_TOOLS list. Per-gremlin allowed_tools filters that list. Requires OPENAI_API_KEY / XAI_API_KEY.

Set via pipeline YAML:

default_client: anthropic:claude-sonnet-4-6
# or per-stage:
stages:
  - name: implement
    client: anthropic:claude-sonnet-4-6

Subscription auth is not available on the SDK backends — that is Anthropic policy, not a gremlins limitation.

Local environment overrides

If .gremlins/env exists in the project root, gremlins sources it through bash at startup and merges any new or changed variables into the process environment before any stage runs. All subprocesses (plan, implement, verify, review) inherit the result automatically.

Security warning: because .gremlins/env is executed as a bash script, it can run arbitrary code. Do not run gremlins in a repository unless you have reviewed the contents of .gremlins/env and trust them.

The file is sourced via bash, so it can use command substitution, conditionals, and anything bash supports:

export VIRTUAL_ENV=$(poetry env info --path)
export PATH="$VIRTUAL_ENV/bin:$PATH"
export TEST_DATABASE_URL=postgresql://localhost/mydb_test

Add .gremlins/env to your ~/.gitignore_global or project .gitignore.

Loader API

gremlins/pipeline/loader.py exposes:

  • load_pipeline(path)Pipeline — parses a YAML file, resolves clients via CLIENT_FACTORIES, and validates every stage type against STAGE_REGISTRY (populated by importing gremlins.stages.all).
  • resolve_pipeline_path(name_or_path, base_dir) — resolves a name or path using the discovery order above.

Dataclasses: Pipeline, StageEntry (parallel groups have type="parallel" internally and carry a children list and optional max_concurrent).

Internals docs

About

Background coding agents that execute, review, and land work unattended.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages