Skip to content

[BUG]: NVMe LED state stuck in FAILURE during hot-plug due to Multipath/Virtual syspath mismatch #274

@czsczsczs2

Description

@czsczsczs2

Description

Hi, I have encountered an issue where the LED state becomes abnormal during NVMe hot-plugging.

During NVMe hot-plug operations on systems with Native Multipathing(nvme_core.multipath=Y), enabled, ledmon fails to track the device state changes correctly. Specifically, when a drive is removed and re-inserted, the LED state becomes stuck in FAILURE. This happens because ledmon stores the physical PCI sysfs path, but udev reports events using the virtual subsystem path(e.g., /sys/devices/virtual/nvme-subsystem/...). Due to this path mismatch, ledmon cannot find the matching block device in its internal list, preventing the state from resetting to NORMAL.

Steps to reproduce bug

  1. Use an NVMe SSD that supports Multi-controller Capabilities (CMIC).

  2. Ensure the kernel is running with nvme_core.multipath=Y (default in many modern distributions).

  3. Start ledmon in debug mode to monitor state transitions.

  4. Perform a hot-remove of the NVMe drive.

  5. Re-insert (hot-plug) the same NVMe drive.

  6. Check the LED state using ledctl --list-slots --controller-type VMD

Expected behavior

When the drive is removed, the LED state should update accordingly (e.g., to an empty or neutral state). Upon re-insertion, the state should return to NORMAL.

Actual behavior

The LED state transitions to FAILURE after removal and remains in FAILURE even after the drive is plugged back in. The logs show No matching block device found because of the discrepancy between the stored physical path and the udev virtual path.

Environment

OS: Linux 6.8.0-83-generic
Disks: NVMe SSD with Multi-controller support (CMIC enabled)
Kernel Parameters: nvme_core.multipath=Y

Ledmon version

Intel(R) Enclosure LED Monitor Service 1.1.0

Ledmon logs

Initial State (Drive Attached)

ledmon[4902]: NEW /sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1: state 'ONESHOT_NORMAL'.
ledmon[4902]: CHANGE /sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1: from 'ONESHOT_NORMAL' to 'UNKNOWN'


root@SKY-9236-efi-N1:~/ledmon/src/ledctl# ./ledctl --list-slots --controller-type VMD
slot: 19              led state: NORMAL          device: /dev/nvme1n1

Drive Removal (Path Mismatch Occurs)

ledmon[4902]: === UDEV Event: action=remove, syspath=/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1 ===
ledmon[4902]: Searching for matching block device in list...
ledmon[4902]:   Comparing: stored sysfs_path=/sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1 with udev syspath=/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1
ledmon[4902]: No matching block device found for syspath: /sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1
ledmon[4902]: CHANGE /sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1: from 'UNKNOWN' to 'FAILURE'.


root@SKY-9236-efi-N1:~/ledmon/src/ledctl# ./ledctl --list-slots --controller-type VMD
slot: 19              led state: FAILURE         device: /dev/nvme1n1

Drive Re-insertion (State Remains FAILURE)

ledmon[4902]: CHANGE /sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1: from 'FAILURE' to 'UNKNOWN'
ledmon[4902]: === UDEV Event: action=add, syspath=/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1 ===
ledmon[4902]: Searching for matching block device in list...
ledmon[4902]:   Comparing: stored sysfs_path=/sys/devices/pci0000:d5/0000:d5:00.5/pci10003:00/10003:00:08.0/10003:0d:00.0/nvme/nvme1/nvme1c1n1 with udev syspath=/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1
ledmon[4902]: No matching block device found for syspath: /sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1
ledmon[4902]: === UDEV Event: action=change, syspath=/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1 ===


root@SKY-9236-efi-N1:~/ledmon/src/ledctl# ./ledctl --list-slots --controller-type VMD
slot: 19              led state: FAILURE         device: /dev/nvme1n1

Ledctl logs

No response

Ledmon supported controllers

VMD (Intel Volume Management Device)

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions