Problem
When running multiple workflows concurrently (e.g. agent-ci run --all), parallel jobs that share a lockfile end up bind-mounting the same host directory at /home/runner/_work/<repo>/<repo>/node_modules. Each container then runs its own pnpm install --frozen-lockfile against that shared mount, racing each other. The result is a corrupted node_modules. repairWarmCache then keeps treating that corrupted state as "warm" on every subsequent run, so the failure is sticky — it does not self-heal.
In our case the failure surfaces as:
> machinen@1.0.0 format:check /home/runner/_work/machinen/machinen
> oxfmt --check
node:internal/modules/cjs/loader:1386
throw err;
^
Error: Cannot find module '/home/runner/_work/machinen/machinen/node_modules/oxfmt/bin/oxfmt'
Inspecting the host warm-modules cache after the failure:
node_modules/oxfmt → symlink to .pnpm/oxfmt@0.33.0/node_modules/oxfmt (intact).
.pnpm/oxfmt@0.33.0/node_modules/oxfmt/bin/ → exists as an empty directory. The oxfmt bin file inside it had been deleted by a concurrent pnpm install mid-flight, but the parent dir and the top-level symlink survived.
.modules.yaml sentinel is still present, so repairWarmCache returns "warm" and reuses the broken cache on every subsequent run.
Where
dist/runner/directory-setup.js:52 keys warmModulesDir by repo + lockfile hash only — no per-job/per-runner suffix:
const warmModulesDir = path.resolve(workDir, "cache", "warm-modules", repoSlug, lockfileHash);
dist/docker/container-config.js:92 bind-mounts that single dir into every container:
`${h(warmModulesDir)}:/home/runner/_work/${repoName}/${repoName}/node_modules`,
dist/cli.js:1168 serializes only the first job of the first wave per workflow when the cache is cold; cross-workflow concurrency (and warm-cache concurrency) is unprotected.
dist/output/cleanup.js repairWarmCache only checks for .modules.yaml / .package-lock.json / .yarn-integrity / .cache. A partially-deleted node_modules with the sentinel intact is incorrectly flagged "warm" forever.
Reproduce
- Repo with multiple workflow files that each run
pnpm install --frozen-lockfile and then a script that exec's a hoisted bin (e.g. oxfmt --check).
npx agent-ci run --all -q -p.
- Intermittently, one of the jobs fails with
Cannot find module .../node_modules/<bin-pkg>/bin/<bin> even though the host warm-modules directory contains an intact symlink at the top level.
- Subsequent re-runs keep failing identically until the warm-modules dir is manually deleted —
repairWarmCache will not detect the corruption.
Suggested fixes
- Per-job warm-modules dirs (key by
repoSlug + lockfileHash + jobId/runnerId), and seed each from a shared read-only template — eliminates the race entirely.
- Or take a process-level lock on the warm-modules dir during
pnpm install so concurrent containers serialize their installs.
- Strengthen
repairWarmCache to detect partial corruption — e.g. resolve the top-level symlinks/bin shims under node_modules/.bin/ and confirm they point at real files. The current sentinel-only check is too weak.
- At minimum, document that
--all requires single-workflow runs when sharing a lockfile, and warn when multiple jobs would mount the same warm-modules path.
Workarounds
- Run workflows one at a time (
agent-ci run --workflow <file>).
- After any failure, manually
rm -rf the warm-modules dir for the repo. Note there are two possible cache roots (/tmp/agent-ci/cli/cache/warm-modules/<slug> and /tmp/agent-ci/agent-ci/cache/warm-modules/<slug>) depending on which agent-ci binary is invoked — both must be cleared.
- Remove
confirm-modules-purge=false from .npmrc so pnpm fails loudly instead of silently mutating shared state.
Versions
@redwoodjs/agent-ci 0.12.4
- pnpm 10.30.3
- Node 22.22.2 (in container)
Problem
When running multiple workflows concurrently (e.g.
agent-ci run --all), parallel jobs that share a lockfile end up bind-mounting the same host directory at/home/runner/_work/<repo>/<repo>/node_modules. Each container then runs its ownpnpm install --frozen-lockfileagainst that shared mount, racing each other. The result is a corruptednode_modules.repairWarmCachethen keeps treating that corrupted state as "warm" on every subsequent run, so the failure is sticky — it does not self-heal.In our case the failure surfaces as:
Inspecting the host warm-modules cache after the failure:
node_modules/oxfmt→ symlink to.pnpm/oxfmt@0.33.0/node_modules/oxfmt(intact)..pnpm/oxfmt@0.33.0/node_modules/oxfmt/bin/→ exists as an empty directory. Theoxfmtbin file inside it had been deleted by a concurrentpnpm installmid-flight, but the parent dir and the top-level symlink survived..modules.yamlsentinel is still present, sorepairWarmCachereturns"warm"and reuses the broken cache on every subsequent run.Where
dist/runner/directory-setup.js:52keyswarmModulesDirby repo + lockfile hash only — no per-job/per-runner suffix:dist/docker/container-config.js:92bind-mounts that single dir into every container:dist/cli.js:1168serializes only the first job of the first wave per workflow when the cache is cold; cross-workflow concurrency (and warm-cache concurrency) is unprotected.dist/output/cleanup.jsrepairWarmCacheonly checks for.modules.yaml/.package-lock.json/.yarn-integrity/.cache. A partially-deletednode_moduleswith the sentinel intact is incorrectly flagged "warm" forever.Reproduce
pnpm install --frozen-lockfileand then a script that exec's a hoisted bin (e.g.oxfmt --check).npx agent-ci run --all -q -p.Cannot find module .../node_modules/<bin-pkg>/bin/<bin>even though the host warm-modules directory contains an intact symlink at the top level.repairWarmCachewill not detect the corruption.Suggested fixes
repoSlug + lockfileHash + jobId/runnerId), and seed each from a shared read-only template — eliminates the race entirely.pnpm installso concurrent containers serialize their installs.repairWarmCacheto detect partial corruption — e.g. resolve the top-level symlinks/bin shims undernode_modules/.bin/and confirm they point at real files. The current sentinel-only check is too weak.--allrequires single-workflow runs when sharing a lockfile, and warn when multiple jobs would mount the same warm-modules path.Workarounds
agent-ci run --workflow <file>).rm -rfthe warm-modules dir for the repo. Note there are two possible cache roots (/tmp/agent-ci/cli/cache/warm-modules/<slug>and/tmp/agent-ci/agent-ci/cache/warm-modules/<slug>) depending on which agent-ci binary is invoked — both must be cleared.confirm-modules-purge=falsefrom.npmrcso pnpm fails loudly instead of silently mutating shared state.Versions
@redwoodjs/agent-ci0.12.4