Skip to content

fix(cli/daemon): suppress libp2p JobRecipient.onProgress recursion (#286)#301

Open
Bojan131 wants to merge 3 commits intomainfrom
fix/libp2p-onprogress-recursion-286
Open

fix(cli/daemon): suppress libp2p JobRecipient.onProgress recursion (#286)#301
Bojan131 wants to merge 3 commits intomainfrom
fix/libp2p-onprogress-recursion-286

Conversation

@Bojan131
Copy link
Copy Markdown
Contributor

Summary

Fixes #286. The DKG daemon was crashing with RangeError: Maximum call stack size exceeded originating inside @libp2p/utils's job-queue dispatcher (JobRecipient.onProgress). The recursive emit happens in a microtask outside our await dial() try block, so we cannot catch it at the call site — we have to swallow it at the top-level handler.

The DKG-side trigger is the reconnect-on-gossip dial in dkg-agent.ts::maybeDialGossipSender, which is best-effort by design — losing a single dial attempt is fine; crashing the whole daemon for it is not.

Changes

  • Extract a pure shouldSuppressFatal(err) classifier from the existing inline logic in runDaemonInner's process.on('uncaughtException') / process.on('unhandledRejection') handlers. Same behaviour for the existing GossipSub stream-state errors — just refactored for testability.

  • Add a second suppression clause for issue Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure #286, matching:

    • err instanceof RangeError, AND
    • message includes "Maximum call stack size exceeded", AND
    • stack contains @libp2p/utils OR JobRecipient.onProgress.

    All three checks together prevent us from accidentally muting a legitimate stack overflow in our own code.

  • Suppressed events log as [warn] Suppressed libp2p JobRecipient.onProgress re-entrancy (issue #286): ... so they remain visible to log readers and on-callers.

Why suppress and not patch upstream?

The recursion is an upstream bug in @libp2p/utils's job queue. Properly fixing it requires changes to libp2p; until that ships, we either crash the daemon or absorb it. This PR is a defensive mitigation, not an upstream fix. Once libp2p ships a real fix and we upgrade, the second clause can be removed.

Test plan

Followed strict TDD:

  1. Reproduce first. Wrote packages/cli/test/daemon-lifecycle-suppress-fatal.test.ts with a test simulating the exact issue Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure #286 stack — it failed against the old handler, confirming the daemon would crash.

  2. Apply fix, re-run — test passed.

  3. Hardened with edge-case suite to lock the contract in both directions:

    Should-suppress cases (✅):

    • The original bug shape (full libp2p stack)
    • Stack contains only the symbol JobRecipient.onProgress (path source-mapped away)
    • Stack contains only the path @libp2p/utils/... (class renamed in a future libp2p version)
    • libp2p vendored under another package's node_modules (transitive bundling)
    • Real V8-produced RangeError with a libp2p frame grafted in — guards against future Node stack-format changes
    • Error message wrapped with surrounding context (e.g. "agent dial failed: Maximum call stack size exceeded — see traces")
    • Existing GossipSub stream-state errors (regression check)

    Should-NOT-suppress cases (❌):

    • RangeError with the magic message but no .stack at all (can't confirm it's the libp2p bug → let it crash)
    • Plain Error (not RangeError) carrying the magic message + libp2p stack (strict instance check)
    • RangeError: Invalid array length from @libp2p/utils (different message → different bug)
    • RangeError stack only mentions @libp2p/pubsub (only @libp2p/utils is the known recursion site)
    • RangeError with the right message but a stack from our own application code (must surface our own infinite recursion bugs)
    • An outer Error with .cause set to the libp2p RangeError (we don't walk .cause chains — pinned explicitly)
    • Non-Error rejections (string, undefined, plain values)
  4. Regression check. Ran the full daemon-lifecycle-* test suite (22 tests across 2 files) — all pass.

  5. Type check. tsc --noEmit clean.

```
Test Files 2 passed (2)
Tests 22 passed (22)
```

Known residual risks

Documented for transparency:

  • Resource accumulation. Each suppression leaves the libp2p job queue in some indeterminate state — orphaned job state could accumulate over very long uptimes. Acceptable trade-off vs. the daemon crashing every few hours, but worth following up upstream.
  • Log volume. At current frequency (multiple times per hour per the issue), one warn line per event is fine. If frequency spikes, consider rate-limiting.
  • Visibility shift. Anyone grepping for [fatal] will no longer find these — they're now [warn] Suppressed .... Worth noting in any runbook so on-callers know the bug is being absorbed.

Out of scope

  • No upstream @libp2p/utils patch in this PR.
  • No changes to dkg-agent.ts::maybeDialGossipSender — that path is already best-effort and behaves correctly.
  • No changes to logging infrastructure or rate-limiting.

Made with Cursor

)

The DKG daemon was crashing with `RangeError: Maximum call stack size
exceeded` originating inside `@libp2p/utils`'s job-queue dispatcher
(`JobRecipient.onProgress`). The recursive emit happens in a microtask
outside our `await dial()` try block, so we cannot catch it at the call
site — we have to swallow it at the top-level handler.

Extract the existing GossipSub stream-state suppression into a pure
`shouldSuppressFatal()` classifier and add a second clause matching
`RangeError` + the magic message + a stack frame from `@libp2p/utils`
(or the `JobRecipient.onProgress` symbol). Both checks together
prevent us from accidentally swallowing a legitimate stack overflow
in our own code.

Tests pin both directions (matches and non-matches), including a
real-V8-produced RangeError to guard against future Node stack-format
changes, and a `.cause`-chain case to keep the unwrap policy explicit.

Made-with: Cursor
if (
err instanceof RangeError &&
msg.includes('Maximum call stack size exceeded') &&
(stack.includes('@libp2p/utils') ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: stack.includes('@libp2p/utils') is broader than the issue this PR describes. It will suppress any Maximum call stack size exceeded coming from anywhere in @libp2p/utils, not just the known queue/job.js / JobRecipient.onProgress recursion, which means a new upstream stack-overflow bug could be silently swallowed while the daemon keeps running. Tighten the predicate to the specific call site (dist/src/queue/job.js and/or JobRecipient.onProgress) and add a negative test for a different @libp2p/utils stack-overflow frame.

Adds a three-part proof harness for the libp2p onProgress recursion
already suppressed in 8930b3f, so future contributors can verify the
matcher still recognises what V8 actually emits — not just hand-built
stack strings.

Why this exists: the original suppression test pins the matcher
against synthetic stacks. After a co-worker shared sanitized log
evidence (79 fatal crashes with `JobRecipient.onProgress` frames) we
confirmed the bug surfaces in `@libp2p/utils@7.1.x` (our `^7.0.10`
range resolves there on fresh installs). This file proves the bug is
real and the fix actually catches it.

PROOF 1 — builds two `Queue`s wired into a progress-dispatch cycle
using the real upstream package (installed under the alias
`@libp2p-utils-issue-286-repro` so the workspace's locked 7.0.10
install is left alone). Triggers the recursion and asserts a real V8
`RangeError` whose stack mentions both `@libp2p/utils` and
`JobRecipient.onProgress` — the exact frames in the issue's log.

PROOF 2 — routes that real error through a faithful copy of
`runDaemonInner`'s `uncaughtException` handler block (with
`process.exit` mocked) and asserts the daemon would NOT exit and
would log the warn line with the right category. Imports the
production `shouldSuppressFatal` directly, so commenting out the
libp2p clause in `lifecycle.ts` makes this test fail immediately —
verified by toggling.

PROOF 3 — spawns a real `node` child process that installs
`process.on('uncaughtException')` with the matcher inlined, builds
the recursion, kicks it via `setImmediate` so it surfaces as a
genuine uncaught exception, and exits 0 only if the handler kept the
process alive. Captures stdout/stderr and asserts no `RangeError`
escaped.

The aliased dep is dev-only and only loaded by this test file, so
production bundles are unaffected.

Made-with: Cursor
@Bojan131
Copy link
Copy Markdown
Contributor Author

Update — real reproduction + E2E proof added (commit 70ff99ca)

Following the sanitized log evidence from @TomazOT (79 fatal crashes with
JobRecipient.onProgress frames), I built a three-part proof harness so
we can be sure the bug is reproducible and the matcher actually catches
what V8 emits — not just hand-built stack strings.

Why the original suppression test wasn't enough

daemon-lifecycle-suppress-fatal.test.ts pins shouldSuppressFatal
against synthetic stack strings. That covers the matcher's logic but
doesn't prove (a) the bug is reproducible at all, (b) the matcher
recognises what V8 actually produces, or (c) a real Node process
survives end-to-end.

Why the bug surfaces upstream

@libp2p/utils@7.1.x adds a synthesised per-job onProgress
dispatcher in Job.run():

onProgress: (evt) => {
  this.recipients.forEach(recipient => {
    recipient.onProgress?.(evt);
  });
}

When two such dispatchers end up wired into each other's recipient
slots — directly or indirectly through DialQueue.dial dedup +
nested progress propagation — a single progress event recurses
unboundedly. Our libp2p dep's ^7.0.10 semver range resolves to
7.1.x on fresh installs, which is why TomazOT's daemon hits this and
ours might too over time.

The three new proofs

packages/cli/test/issue-286-real-repro.test.ts:

  1. Reproduces the bug. Builds two Queues wired into a
    progress-dispatch cycle using the real upstream package
    (installed under the alias @libp2p-utils-issue-286-repro so the
    workspace's locked 7.0.10 install is left alone). Triggers the
    recursion; asserts a real V8 RangeError whose stack mentions
    both @libp2p/utils and JobRecipient.onProgress — the exact
    frames in the issue's log.
  2. Routes the real error through the daemon's actual handler
    logic.
    Imports the production shouldSuppressFatal, mirrors
    the runDaemonInner uncaughtException block (with process.exit
    mocked), asserts exited === false and the warn-line is logged
    with the right category. This is the load-bearing test —
    commenting out the libp2p clause in lifecycle.ts makes it fail
    immediately (verified by toggling).
  3. Spawns a real Node child process that installs
    process.on('uncaughtException') with the matcher inlined, builds
    the recursion, kicks it via setImmediate so it surfaces as a
    genuine uncaught exception, and exits 0 only if the handler kept
    the process alive. Captures stdout/stderr and asserts no
    RangeError escaped.

Stack comparison

OUR REPRO (real V8 output)                   TOMAZOT'S LOG
──────────────────────────                   ─────────────
RangeError: Maximum call stack size...       RangeError: Maximum call stack size...
  at Array.forEach (<anonymous>)               at JobRecipient.onProgress (job.js:59:29)
  at onProgress (job.js:60:37)                 at job.js:61:47
  at JobRecipient.onProgress (test:92:39)      at Array.forEach (<anonymous>)
  at job.js:61:47                              at JobRecipient.onProgress (job.js:60:37)
  at Array.forEach (<anonymous>)               at job.js:61:47
  ... (repeats) ...                            ... (repeats) ...

Same package, same file, same lines (60–61), same JobRecipient.onProgress
frame name, same Array.forEach interleave, same recursive shape.

Test run

✓ test/issue-286-real-repro.test.ts (3 tests) 311ms
   ✓ PROOF 1/3: reproduces a real RangeError with real @libp2p/utils stack frames
   ✓ PROOF 2/3: routed through the daemon's uncaughtException handler logic, the fix keeps the daemon alive
   ✓ PROOF 3/3: spawns a real Node child process that triggers the bug ... and survives
✓ test/daemon-lifecycle-suppress-fatal.test.ts (19 tests) 4ms
Tests  22 passed (22)

Footprint

  • @libp2p-utils-issue-286-repro is a dev-only npm alias for
    @libp2p/utils@7.1.0, only loaded by this one test file —
    production bundles are unaffected.
  • No production code changed in this commit; it's purely additional
    test coverage on top of the original fix.

Comment thread packages/cli/package.json
"typescript": "^5.7"
},
"devDependencies": {
"@libp2p-utils-issue-286-repro": "npm:@libp2p/utils@7.1.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: @libp2p-utils-issue-286-repro is not a valid bare package name. Node rejects import '@libp2p-utils-issue-286-repro' with ERR_INVALID_MODULE_SPECIFIER, so the new repro test will fail before it even runs. Use a valid alias name here (for example libp2p-utils-issue-286-repro or a real scoped name like @scope/libp2p-utils-issue-286-repro) and update the test imports/paths to match.

err instanceof RangeError &&
msg.includes('Maximum call stack size exceeded') &&
(stack.includes('@libp2p/utils') ||
stack.includes('@libp2p-utils-issue-286-repro') ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this child-process harness is broader than the production matcher in lifecycle.ts because it suppresses stacks containing @libp2p-utils-issue-286-repro, but the real daemon never checks for that alias. That means PROOF 3/3 can go green even if production would still exit on the same error. Keep the predicate byte-for-byte identical to shouldSuppressFatal, or import the real helper into the harness instead of widening it here.

import { fileURLToPath } from 'node:url';
import { dirname } from 'node:path';

import { Queue } from '@libp2p-utils-issue-286-repro';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: @libp2p-utils-issue-286-repro is not a valid Node package specifier, so this import fails before the test body runs (node -e "import('@libp2p-utils-issue-286-repro')" throws ERR_INVALID_MODULE_SPECIFIER). Rename the alias to a valid package name and update the dependency/path references accordingly.

err instanceof RangeError &&
msg.includes('Maximum call stack size exceeded') &&
(stack.includes('@libp2p/utils') ||
stack.includes('@libp2p-utils-issue-286-repro') ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: the child-process harness widens the matcher with @libp2p-utils-issue-286-repro, but lifecycle.ts::shouldSuppressFatal does not have that branch. This means the supposed end-to-end proof can pass even when the real daemon would still exit on the same stack shape. Keep the harness on the exact production predicate, or import the real helper into the child process instead of adding alias-only matching here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure

1 participant