Skip to content

Retrigger udev on failure to get device serial #152

@ryanpgoogle

Description

@ryanpgoogle

What happened:
We implicitly rely on udev to gather the serial ID of devices as they are added to the guest. If udev fails, for example due to transient networking issues on the underlying data path, it does not retry. We are therefore stuck with failed mounts

What you expected to happen:
NodeStageVolume can retrigger udev if it cannot find the serial ID, allowing us to eventually succeed once the underlying networking issues are resolved.

How to reproduce it (as minimally and precisely as possible):
You can reproduce this in a slightly contrived way by adding a udev rule that will force udev to timeout, simulating a command timing out due to a networking issue. I did this by adding the following line to /usr/lib/udev/rules.d/60-persistent-storage.rules

KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/bin/sleep 600"

After creating a pod and pvc on this tenant node, you can see mount errors in the pod events:

Warning  FailedMount             10s (x6 over 27s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-0966c060-98fc-4037-93f9-3833b5874e98" : rpc error: code = Unknown desc = couldn't find device by serial id

and udev logs show that it quits:

Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' is taking a long time
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' is taking a long time
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda'
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd

Checking back periodically I see that it does not retry, as expected since it is event driven.

Environment:
Looking at the code I think this should happen in most envs and versions

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/buglifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions