Skip to content

Warm-modules cache races between concurrent jobs corrupt node_modules #321

@peterp

Description

@peterp

Problem

When running multiple workflows concurrently (e.g. agent-ci run --all), parallel jobs that share a lockfile end up bind-mounting the same host directory at /home/runner/_work/<repo>/<repo>/node_modules. Each container then runs its own pnpm install --frozen-lockfile against that shared mount, racing each other. The result is a corrupted node_modules. repairWarmCache then keeps treating that corrupted state as "warm" on every subsequent run, so the failure is sticky — it does not self-heal.

In our case the failure surfaces as:

> machinen@1.0.0 format:check /home/runner/_work/machinen/machinen
> oxfmt --check
node:internal/modules/cjs/loader:1386
throw err;
^
Error: Cannot find module '/home/runner/_work/machinen/machinen/node_modules/oxfmt/bin/oxfmt'

Inspecting the host warm-modules cache after the failure:

  • node_modules/oxfmt → symlink to .pnpm/oxfmt@0.33.0/node_modules/oxfmt (intact).
  • .pnpm/oxfmt@0.33.0/node_modules/oxfmt/bin/ → exists as an empty directory. The oxfmt bin file inside it had been deleted by a concurrent pnpm install mid-flight, but the parent dir and the top-level symlink survived.
  • .modules.yaml sentinel is still present, so repairWarmCache returns "warm" and reuses the broken cache on every subsequent run.

Where

  • dist/runner/directory-setup.js:52 keys warmModulesDir by repo + lockfile hash only — no per-job/per-runner suffix:
    const warmModulesDir = path.resolve(workDir, "cache", "warm-modules", repoSlug, lockfileHash);
  • dist/docker/container-config.js:92 bind-mounts that single dir into every container:
    `${h(warmModulesDir)}:/home/runner/_work/${repoName}/${repoName}/node_modules`,
  • dist/cli.js:1168 serializes only the first job of the first wave per workflow when the cache is cold; cross-workflow concurrency (and warm-cache concurrency) is unprotected.
  • dist/output/cleanup.js repairWarmCache only checks for .modules.yaml / .package-lock.json / .yarn-integrity / .cache. A partially-deleted node_modules with the sentinel intact is incorrectly flagged "warm" forever.

Reproduce

  1. Repo with multiple workflow files that each run pnpm install --frozen-lockfile and then a script that exec's a hoisted bin (e.g. oxfmt --check).
  2. npx agent-ci run --all -q -p.
  3. Intermittently, one of the jobs fails with Cannot find module .../node_modules/<bin-pkg>/bin/<bin> even though the host warm-modules directory contains an intact symlink at the top level.
  4. Subsequent re-runs keep failing identically until the warm-modules dir is manually deleted — repairWarmCache will not detect the corruption.

Suggested fixes

  • Per-job warm-modules dirs (key by repoSlug + lockfileHash + jobId/runnerId), and seed each from a shared read-only template — eliminates the race entirely.
  • Or take a process-level lock on the warm-modules dir during pnpm install so concurrent containers serialize their installs.
  • Strengthen repairWarmCache to detect partial corruption — e.g. resolve the top-level symlinks/bin shims under node_modules/.bin/ and confirm they point at real files. The current sentinel-only check is too weak.
  • At minimum, document that --all requires single-workflow runs when sharing a lockfile, and warn when multiple jobs would mount the same warm-modules path.

Workarounds

  • Run workflows one at a time (agent-ci run --workflow <file>).
  • After any failure, manually rm -rf the warm-modules dir for the repo. Note there are two possible cache roots (/tmp/agent-ci/cli/cache/warm-modules/<slug> and /tmp/agent-ci/agent-ci/cache/warm-modules/<slug>) depending on which agent-ci binary is invoked — both must be cleared.
  • Remove confirm-modules-purge=false from .npmrc so pnpm fails loudly instead of silently mutating shared state.

Versions

  • @redwoodjs/agent-ci 0.12.4
  • pnpm 10.30.3
  • Node 22.22.2 (in container)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions