Skip to content

Daemon crashes repeatedly with 'Maximum call stack size exceeded' on reconnect-on-gossip dial failure #286

@TomazOT

Description

@TomazOT

Bug Report: Daemon Crashes with "Maximum call stack size exceeded" on Reconnect-on-gossip

Author: Arx (did:dkg:agent:12D3KooWM7JGP69boLmxo56qToYWtt9uaumCcHc2smNkkmTfyjRK)
Date: 2026-04-24
Status: Open
Environment: DKG V10 RC 10.0.0-rc.1-dev.1777021554.93866af, macOS arm64, Node.js v24.15.0
Labels: bug, daemon, p2p, crash


Summary

The DKG daemon crashes fatally and repeatedly with RangeError: Maximum call stack size exceeded when a reconnect-on-gossip dial fails. The crash occurs in JobRecipient.onProgress inside @libp2p/utils. The daemon requires a manual restart each time and loses any in-flight operations.


Reproduction

The crash is triggered automatically on an active testnet with multiple peers. No manual steps required beyond starting the daemon and waiting for gossip activity.

Frequency: multiple times per hour during normal testnet operation.


Log Evidence

[DKGAgent] Reconnect-on-gossip: peerStore dial to <peer> failed (Maximum call stack size exceeded); trying relay fallbacks
[fatal] Uncaught exception: RangeError: Maximum call stack size exceeded
    at JobRecipient.onProgress (file:///.../node_modules/@libp2p/utils/dist/src/queue/job.js:59:29)
    at file:///.../node_modules/@libp2p/utils/dist/src/queue/job.js:61:47
    at Array.forEach (<anonymous>)
    at JobRecipient.onProgress (file:///.../node_modules/@libp2p/utils/dist/src/queue/job.js:60:37)
    at file:///.../node_modules/@libp2p/utils/dist/src/queue/job.js:61:47
    at Array.forEach (<anonymous>)
    at JobRecipient.onProgress (file:///.../node_modules/@libp2p/utils/dist/src/queue/job.js:60:37)
    ...

Full path: /opt/homebrew/lib/node_modules/@origintrail-official/dkg/node_modules/@libp2p/utils/dist/src/queue/job.js


Root Cause Analysis

The crash originates in @libp2p/utils Job.run(). When a job produces progress events, it iterates over all recipients and calls recipient.onProgress(evt) via forEach. If any onProgress handler itself triggers another progress event on the same job (directly or indirectly through a callback chain), the calls recurse unboundedly until the stack is exhausted.

The trigger is the reconnect-on-gossip path in DKG's agent code: when a peer store dial fails, the error propagates back through the libp2p job queue's progress mechanism, and the error handler apparently re-enters the same queue path, producing the recursive loop.

Relevant code in job.js:

onProgress: (evt) => {
    this.recipients.forEach(recipient => {
        recipient.onProgress?.(evt);  // line 59-61 — no guard against re-entrancy
    });
}

This is a re-entrancy issue. The onProgress dispatcher has no guard to prevent recursive calls.


Impact

  • Daemon crashes fatally — Node.js exits, all in-flight operations are lost
  • Requires manual dkg start to recover
  • Occurs repeatedly during normal testnet activity, making the daemon unreliable for sustained use
  • Any data written to WM but not yet committed to disk may be lost on crash

Suggested Investigation

  1. Check how DKG's reconnect-on-gossip code uses the libp2p job queue's onProgress callback and whether it re-enters the queue on dial failure
  2. Add a re-entrancy guard to the onProgress dispatcher in job.js, or wrap the DKG-side onProgress handler to prevent recursive calls
  3. Consider catching the RangeError at the top-level fatal handler and attempting a graceful daemon restart rather than a hard exit, as a short-term mitigation

Workaround

None currently. Daemon must be restarted manually after each crash. Adding a process supervisor (e.g. launchd KeepAlive or pm2) reduces downtime but does not prevent the crash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions