Retrigger udev on failure to get device serial

**What happened**:
We implicitly rely on udev to gather the serial ID of devices as they are added to the guest. If udev fails, for example due to transient networking issues on the underlying data path, it does not retry. We are therefore stuck with failed mounts

**What you expected to happen**:
NodeStageVolume can retrigger udev if it cannot find the serial ID, allowing us to eventually succeed once the underlying networking issues are resolved.

**How to reproduce it (as minimally and precisely as possible)**:
You can reproduce this in a slightly contrived way by adding a udev rule that will force udev to timeout, simulating a command timing out due to a networking issue. I did this by adding the following line to `/usr/lib/udev/rules.d/60-persistent-storage.rules`

```
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/bin/sleep 600"
```

After creating a pod and pvc on this tenant node, you can see mount errors in the pod events:
```
Warning  FailedMount             10s (x6 over 27s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-0966c060-98fc-4037-93f9-3833b5874e98" : rpc error: code = Unknown desc = couldn't find device by serial id
```

and udev logs show that it quits:
```
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' is taking a long time
Oct 03 17:01:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' is taking a long time
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7638 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: seq 7640 '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd' killed
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943916] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:0/block/sda'
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] terminated by signal 9 (KILL)
Oct 03 17:03:20 vm-018c7ff6 systemd-udevd[2674348]: worker [2943915] failed while handling '/devices/pci0000:00/0000:00:02.4/0000:05:00.0/virtio1/host0/target0:0:0/0:0:0:3/block/sdd
```

Checking back periodically I see that it does not retry, as expected since it is event driven.

**Environment**:
Looking at the code I think this should happen in most envs and versions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrigger udev on failure to get device serial #152

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retrigger udev on failure to get device serial #152

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions