fix(cli/daemon): suppress libp2p JobRecipient.onProgress recursion (#286)#301
fix(cli/daemon): suppress libp2p JobRecipient.onProgress recursion (#286)#301
Conversation
) The DKG daemon was crashing with `RangeError: Maximum call stack size exceeded` originating inside `@libp2p/utils`'s job-queue dispatcher (`JobRecipient.onProgress`). The recursive emit happens in a microtask outside our `await dial()` try block, so we cannot catch it at the call site — we have to swallow it at the top-level handler. Extract the existing GossipSub stream-state suppression into a pure `shouldSuppressFatal()` classifier and add a second clause matching `RangeError` + the magic message + a stack frame from `@libp2p/utils` (or the `JobRecipient.onProgress` symbol). Both checks together prevent us from accidentally swallowing a legitimate stack overflow in our own code. Tests pin both directions (matches and non-matches), including a real-V8-produced RangeError to guard against future Node stack-format changes, and a `.cause`-chain case to keep the unwrap policy explicit. Made-with: Cursor
| if ( | ||
| err instanceof RangeError && | ||
| msg.includes('Maximum call stack size exceeded') && | ||
| (stack.includes('@libp2p/utils') || |
There was a problem hiding this comment.
🔴 Bug: stack.includes('@libp2p/utils') is broader than the issue this PR describes. It will suppress any Maximum call stack size exceeded coming from anywhere in @libp2p/utils, not just the known queue/job.js / JobRecipient.onProgress recursion, which means a new upstream stack-overflow bug could be silently swallowed while the daemon keeps running. Tighten the predicate to the specific call site (dist/src/queue/job.js and/or JobRecipient.onProgress) and add a negative test for a different @libp2p/utils stack-overflow frame.
Adds a three-part proof harness for the libp2p onProgress recursion already suppressed in 8930b3f, so future contributors can verify the matcher still recognises what V8 actually emits — not just hand-built stack strings. Why this exists: the original suppression test pins the matcher against synthetic stacks. After a co-worker shared sanitized log evidence (79 fatal crashes with `JobRecipient.onProgress` frames) we confirmed the bug surfaces in `@libp2p/utils@7.1.x` (our `^7.0.10` range resolves there on fresh installs). This file proves the bug is real and the fix actually catches it. PROOF 1 — builds two `Queue`s wired into a progress-dispatch cycle using the real upstream package (installed under the alias `@libp2p-utils-issue-286-repro` so the workspace's locked 7.0.10 install is left alone). Triggers the recursion and asserts a real V8 `RangeError` whose stack mentions both `@libp2p/utils` and `JobRecipient.onProgress` — the exact frames in the issue's log. PROOF 2 — routes that real error through a faithful copy of `runDaemonInner`'s `uncaughtException` handler block (with `process.exit` mocked) and asserts the daemon would NOT exit and would log the warn line with the right category. Imports the production `shouldSuppressFatal` directly, so commenting out the libp2p clause in `lifecycle.ts` makes this test fail immediately — verified by toggling. PROOF 3 — spawns a real `node` child process that installs `process.on('uncaughtException')` with the matcher inlined, builds the recursion, kicks it via `setImmediate` so it surfaces as a genuine uncaught exception, and exits 0 only if the handler kept the process alive. Captures stdout/stderr and asserts no `RangeError` escaped. The aliased dep is dev-only and only loaded by this test file, so production bundles are unaffected. Made-with: Cursor
Update — real reproduction + E2E proof added (commit
|
| "typescript": "^5.7" | ||
| }, | ||
| "devDependencies": { | ||
| "@libp2p-utils-issue-286-repro": "npm:@libp2p/utils@7.1.0", |
There was a problem hiding this comment.
🔴 Bug: @libp2p-utils-issue-286-repro is not a valid bare package name. Node rejects import '@libp2p-utils-issue-286-repro' with ERR_INVALID_MODULE_SPECIFIER, so the new repro test will fail before it even runs. Use a valid alias name here (for example libp2p-utils-issue-286-repro or a real scoped name like @scope/libp2p-utils-issue-286-repro) and update the test imports/paths to match.
| err instanceof RangeError && | ||
| msg.includes('Maximum call stack size exceeded') && | ||
| (stack.includes('@libp2p/utils') || | ||
| stack.includes('@libp2p-utils-issue-286-repro') || |
There was a problem hiding this comment.
🔴 Bug: this child-process harness is broader than the production matcher in lifecycle.ts because it suppresses stacks containing @libp2p-utils-issue-286-repro, but the real daemon never checks for that alias. That means PROOF 3/3 can go green even if production would still exit on the same error. Keep the predicate byte-for-byte identical to shouldSuppressFatal, or import the real helper into the harness instead of widening it here.
| import { fileURLToPath } from 'node:url'; | ||
| import { dirname } from 'node:path'; | ||
|
|
||
| import { Queue } from '@libp2p-utils-issue-286-repro'; |
There was a problem hiding this comment.
🔴 Bug: @libp2p-utils-issue-286-repro is not a valid Node package specifier, so this import fails before the test body runs (node -e "import('@libp2p-utils-issue-286-repro')" throws ERR_INVALID_MODULE_SPECIFIER). Rename the alias to a valid package name and update the dependency/path references accordingly.
| err instanceof RangeError && | ||
| msg.includes('Maximum call stack size exceeded') && | ||
| (stack.includes('@libp2p/utils') || | ||
| stack.includes('@libp2p-utils-issue-286-repro') || |
There was a problem hiding this comment.
🟡 Issue: the child-process harness widens the matcher with @libp2p-utils-issue-286-repro, but lifecycle.ts::shouldSuppressFatal does not have that branch. This means the supposed end-to-end proof can pass even when the real daemon would still exit on the same stack shape. Keep the harness on the exact production predicate, or import the real helper into the child process instead of adding alias-only matching here.
Summary
Fixes #286. The DKG daemon was crashing with
RangeError: Maximum call stack size exceededoriginating inside@libp2p/utils's job-queue dispatcher (JobRecipient.onProgress). The recursive emit happens in a microtask outside ourawait dial()try block, so we cannot catch it at the call site — we have to swallow it at the top-level handler.The DKG-side trigger is the reconnect-on-gossip dial in
dkg-agent.ts::maybeDialGossipSender, which is best-effort by design — losing a single dial attempt is fine; crashing the whole daemon for it is not.Changes
Extract a pure
shouldSuppressFatal(err)classifier from the existing inline logic inrunDaemonInner'sprocess.on('uncaughtException')/process.on('unhandledRejection')handlers. Same behaviour for the existing GossipSub stream-state errors — just refactored for testability.Add a second suppression clause for issue Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure #286, matching:
err instanceof RangeError, AND"Maximum call stack size exceeded", AND@libp2p/utilsORJobRecipient.onProgress.All three checks together prevent us from accidentally muting a legitimate stack overflow in our own code.
Suppressed events log as
[warn] Suppressed libp2p JobRecipient.onProgress re-entrancy (issue #286): ...so they remain visible to log readers and on-callers.Why suppress and not patch upstream?
The recursion is an upstream bug in
@libp2p/utils's job queue. Properly fixing it requires changes to libp2p; until that ships, we either crash the daemon or absorb it. This PR is a defensive mitigation, not an upstream fix. Once libp2p ships a real fix and we upgrade, the second clause can be removed.Test plan
Followed strict TDD:
Reproduce first. Wrote
packages/cli/test/daemon-lifecycle-suppress-fatal.test.tswith a test simulating the exact issue Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure #286 stack — it failed against the old handler, confirming the daemon would crash.Apply fix, re-run — test passed.
Hardened with edge-case suite to lock the contract in both directions:
Should-suppress cases (✅):
JobRecipient.onProgress(path source-mapped away)@libp2p/utils/...(class renamed in a future libp2p version)node_modules(transitive bundling)"agent dial failed: Maximum call stack size exceeded — see traces")Should-NOT-suppress cases (❌):
.stackat all (can't confirm it's the libp2p bug → let it crash)Error(notRangeError) carrying the magic message + libp2p stack (strict instance check)RangeError: Invalid array lengthfrom@libp2p/utils(different message → different bug)@libp2p/pubsub(only@libp2p/utilsis the known recursion site)Errorwith.causeset to the libp2p RangeError (we don't walk.causechains — pinned explicitly)Regression check. Ran the full
daemon-lifecycle-*test suite (22 tests across 2 files) — all pass.Type check.
tsc --noEmitclean.```
Test Files 2 passed (2)
Tests 22 passed (22)
```
Known residual risks
Documented for transparency:
[fatal]will no longer find these — they're now[warn] Suppressed .... Worth noting in any runbook so on-callers know the bug is being absorbed.Out of scope
@libp2p/utilspatch in this PR.dkg-agent.ts::maybeDialGossipSender— that path is already best-effort and behaves correctly.Made with Cursor