Warm-modules cache races between concurrent jobs corrupt node_modules

## Problem

When running multiple workflows concurrently (e.g. `agent-ci run --all`), parallel jobs that share a lockfile end up bind-mounting the **same** host directory at `/home/runner/_work/<repo>/<repo>/node_modules`. Each container then runs its own `pnpm install --frozen-lockfile` against that shared mount, racing each other. The result is a corrupted `node_modules`. `repairWarmCache` then keeps treating that corrupted state as "warm" on every subsequent run, so the failure is sticky — it does not self-heal.

In our case the failure surfaces as:

```
> machinen@1.0.0 format:check /home/runner/_work/machinen/machinen
> oxfmt --check
node:internal/modules/cjs/loader:1386
throw err;
^
Error: Cannot find module '/home/runner/_work/machinen/machinen/node_modules/oxfmt/bin/oxfmt'
```

Inspecting the host warm-modules cache after the failure:

- `node_modules/oxfmt` → symlink to `.pnpm/oxfmt@0.33.0/node_modules/oxfmt` (intact).
- `.pnpm/oxfmt@0.33.0/node_modules/oxfmt/bin/` → exists as an **empty directory**. The `oxfmt` bin file inside it had been deleted by a concurrent `pnpm install` mid-flight, but the parent dir and the top-level symlink survived.
- `.modules.yaml` sentinel is still present, so `repairWarmCache` returns `"warm"` and reuses the broken cache on every subsequent run.

## Where

- `dist/runner/directory-setup.js:52` keys `warmModulesDir` by **repo + lockfile hash only** — no per-job/per-runner suffix:
  ```js
  const warmModulesDir = path.resolve(workDir, "cache", "warm-modules", repoSlug, lockfileHash);
  ```
- `dist/docker/container-config.js:92` bind-mounts that single dir into every container:
  ```js
  `${h(warmModulesDir)}:/home/runner/_work/${repoName}/${repoName}/node_modules`,
  ```
- `dist/cli.js:1168` serializes only the **first job of the first wave per workflow** when the cache is cold; cross-workflow concurrency (and warm-cache concurrency) is unprotected.
- `dist/output/cleanup.js` `repairWarmCache` only checks for `.modules.yaml` / `.package-lock.json` / `.yarn-integrity` / `.cache`. A partially-deleted `node_modules` with the sentinel intact is incorrectly flagged "warm" forever.

## Reproduce

1. Repo with multiple workflow files that each run `pnpm install --frozen-lockfile` and then a script that exec's a hoisted bin (e.g. `oxfmt --check`).
2. `npx agent-ci run --all -q -p`.
3. Intermittently, one of the jobs fails with `Cannot find module .../node_modules/<bin-pkg>/bin/<bin>` even though the host warm-modules directory contains an intact symlink at the top level.
4. Subsequent re-runs keep failing identically until the warm-modules dir is manually deleted — `repairWarmCache` will not detect the corruption.

## Suggested fixes

- Per-job warm-modules dirs (key by `repoSlug + lockfileHash + jobId/runnerId`), and seed each from a shared read-only template — eliminates the race entirely.
- Or take a process-level lock on the warm-modules dir during `pnpm install` so concurrent containers serialize their installs.
- Strengthen `repairWarmCache` to detect partial corruption — e.g. resolve the top-level symlinks/bin shims under `node_modules/.bin/` and confirm they point at real files. The current sentinel-only check is too weak.
- At minimum, document that `--all` requires single-workflow runs when sharing a lockfile, and warn when multiple jobs would mount the same warm-modules path.

## Workarounds

- Run workflows one at a time (`agent-ci run --workflow <file>`).
- After any failure, manually `rm -rf` the warm-modules dir for the repo. Note there are **two** possible cache roots (`/tmp/agent-ci/cli/cache/warm-modules/<slug>` and `/tmp/agent-ci/agent-ci/cache/warm-modules/<slug>`) depending on which agent-ci binary is invoked — both must be cleared.
- Remove `confirm-modules-purge=false` from `.npmrc` so pnpm fails loudly instead of silently mutating shared state.

## Versions

- `@redwoodjs/agent-ci` 0.12.4
- pnpm 10.30.3
- Node 22.22.2 (in container)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm-modules cache races between concurrent jobs corrupt node_modules #321

Problem

Where

Reproduce

Suggested fixes

Workarounds

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Warm-modules cache races between concurrent jobs corrupt node_modules #321

Description

Problem

Where

Reproduce

Suggested fixes

Workarounds

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions