Skip to content

False same panId or extendedPanId already exists nearby startup failure after recovery/firmware events in multi-coordinator Home Assistant deployments #1768

@335if30-debug

Description

@335if30-debug

False same panId or extendedPanId already exists nearby startup failure after recovery/firmware events in multi-coordinator Home Assistant deployments

Summary

We are seeing repeatable Zigbee2MQTT startup failures with:

network commissioning timed out - most likely network with the same panId or extendedPanId already exists nearby
(Error: AREQ - ZDO - stateChangeInd after 60000ms)

This has occurred across two independent Home Assistant households with multiple Zigbee2MQTT instances/coordinators. The symptom does not look like an intentionally running second coordinator with the same PAN/ExtPAN. It looks like zigbee-herdsman/Z-Stack enters a commissioning path where either the coordinator sees its own existing mesh routers or stale adapter/NVRAM/backup state as a foreign PAN/ExtPAN collision.

The practical impact is severe: a previously working mesh refuses to start, and the error points users toward changing PAN/network identity, which can require mass re-pairing. In our recoveries, preserving pan_id, ext_pan_id, and network_key and temporarily changing only the channel restored the original mesh.

Why this may be structural

Multiple Zigbee2MQTT instances/coordinators are common in larger Home Assistant deployments. The current startup/restore diagnostics appear fragile when:

  • multiple Zigbee2MQTT add-ons/coordinators exist in the same HA environment,
  • adapters are SLZB/TCP based,
  • firmware/core/radio was updated or rolled back,
  • coordinator_backup.json, adapter NVRAM, and configuration.yaml temporarily disagree,
  • routers from the previous/own mesh are still powered and beaconing.

In this state, the failure is reported as if an external same-PAN network exists. But the successful recoveries kept the same PAN/ExtPAN/key and restored the existing devices, suggesting the network identity itself was valid.

Live sanitized evidence from one environment

Read-only inspection after recovery shows that config and backup currently agree on the critical identity:

/config/zigbee2mqtttt (active/recovered instance)

  • base topic: zigbee2mqtttt
  • serial adapter: zstack, port: <redacted private TCP SLZB address>
  • channel: 15
  • PAN config↔backup match: True
  • ExtPAN config↔backup match: True
  • channel config↔backup match: True
  • network key fingerprint config↔backup match: True
  • historical relevant log hits: 7
  • example log: [2026-05-30 12:51:55] error: z2m: Error: network commissioning timed out - most likely network with the same panId or extendedPanId already exists nearby (Error: AREQ - ZDO - stateChangeInd after 60000ms

/config/zigbee2mqttttt (active/recovered instance)

  • base topic: zigbee2mqttttt
  • serial adapter: zstack, port: <redacted private TCP SLZB address>
  • channel: 20
  • PAN config↔backup match: True
  • ExtPAN config↔backup match: True
  • channel config↔backup match: True
  • network key fingerprint config↔backup match: True
  • historical relevant log hits: 5
  • example log: [2026-05-30 11:48:36] error: z2m: Error: network commissioning timed out - most likely network with the same panId or extendedPanId already exists nearby (Error: AREQ - ZDO - stateChangeInd after 60000ms

/config/zigbeeEG2 (stale/manual instance, included because it exists in the same multi-instance environment)

  • base topic: ZigbeeEG2
  • serial adapter: zstack, port: <redacted private TCP SLZB address>
  • channel: 11
  • PAN config↔backup match: True
  • ExtPAN config↔backup match: True
  • channel config↔backup match: True
  • network key fingerprint config↔backup match: True
  • historical relevant log hits: 0

Observed recovery patterns

Case A: Household 1, one Z2M instance

  • Original channel: 20
  • Error: same network commissioning timed out... same panId or extendedPanId...
  • Recovery that worked:
    1. stop Z2M cleanly
    2. back up configuration.yaml, coordinator_backup.json, database.db
    3. temporarily change only channel 20 -> 15
    4. start Z2M and let it generate/repair coordinator backup
    5. stop Z2M
    6. set config and backup back to channel 20
    7. verify pan_id, ext_pan_id, network_key unchanged
    8. restore pre-wiggle database.db
    9. start Z2M
  • Result: expected mesh/device inventory returned; no mass re-pairing.

Case B: Household 1, second Z2M instance

  • Original channel: 15
  • Error: same network commissioning timed out... same panId or extendedPanId...
  • First temporary channel 15 -> 20 still failed.
  • Second temporary channel 15 -> 25 -> 15 succeeded.
  • PAN/ExtPAN/network key were preserved; database.db restored after clean stop.
  • Result: existing mesh returned.

Case C: Household 2, separate HA environment

  • Original channel: 11
  • Error: same network commissioning timed out... same panId or extendedPanId...
  • Recovery: channel wiggle 11 -> 15 -> 11, preserve PAN/ExtPAN/key, restore DB.
  • Result: existing mesh returned.
  • Sanitized live logs for this second household can be provided separately.

Related upstream symptoms

Code path / diagnostic concern

In zigbee-herdsman src/adapter/z-stack/adapter/manager.ts, the failure appears during beginCommissioning():

  • bdbStartCommissioning is called with mode 0x04
  • code waits for ZDO stateChangeInd for 60s
  • on timeout it throws the same-PAN/ExtPAN message

There is also a later explicit panId collision detected check. This suggests the timeout message is heuristic and may hide other root causes.

Expected behavior

When configuration.yaml + coordinator_backup.json contain the same PAN/ExtPAN/network key as the existing mesh, and the same coordinator is being restarted/restored:

  1. Zigbee2MQTT should reliably restore/start the existing network without falling into a fragile commissioning collision path.
  2. If it cannot, the error should distinguish:
    • true external duplicate PAN/ExtPAN detected,
    • adapter NVRAM/config/backup mismatch,
    • own mesh routers beaconing while coordinator is re-forming,
    • firmware/baudrate/TCP adapter issue,
    • backup parse/restore failure.
  3. The suggested recovery should avoid PAN/ExtPAN/network key changes unless the user explicitly accepts re-pairing.

Questions for maintainers

  1. Under what exact conditions does startup strategy choose startCommissioning instead of startup or restoreBackup when a coordinator_backup.json exists?
  2. Can the code detect and report when config/backup/adapter NVRAM mismatch is the real cause before calling BDB commissioning?
  3. During commissioning, can the coordinator distinguish "my own mesh routers are beaconing with the expected ExtPAN" from a true foreign PAN collision?
  4. Would maintainers accept diagnostic improvements that print:
    • startup strategy,
    • config vs backup vs adapter channel/PAN/ExtPAN comparison,
    • whether coordinator_backup.json was accepted/rejected,
    • exact reason for entering startCommissioning?
  5. Is a safe channel-wiggle preserving PAN/ExtPAN/key an acceptable documented recovery path compared with PAN-ID wiggle workarounds?

Safety note

Changing pan_id, ext_pan_id, or network_key is not acceptable as a general workaround for large production meshes, because it can strand devices or require mass re-pairing. In the successful recoveries above, those values were preserved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions