Skip to content

Bottlerocket 1.48.0 EKS 1.34 image will not mount EBS volumes over nvme15 #4678

@johnjeffers

Description

@johnjeffers

Image I'm using:
EKS 1.34 Bottlerocket AMI 1.48.0 (arm64)

What I expected to happen:
All volumes attached to EC2 are properly mounted and available for kubernetes persistent volumes

What actually happened:

  • Volumes with an index greater than 15 are not mounted
  • Kubernetes pods that require the associated PV will not start
  • This issue wasn't present in EKS 1.33 AMIs, and only started after upgrading to EKS 1.34 AMIs.
  • Upgrading to Bottlerocket 1.49.0 did not resolve the issue

The EC2 instance shows the EBS volumes attached in the EC2 console, but SSHing into the instance and running lsblk shows that volumes with indexes greater than 15 are not properly mounted.

The following errors occur when trying to mount /dev/xvdao and subsequent devices.

Oct 23 06:32:49 ip-10-128-14-122.ec2.internal kubelet[2201]: E1023 06:32:49.753731 2201 csi_attacher.go:401] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = NotFound desc = Failed to find device path /dev/xvdap. no device path for device "/dev/xvdap" volume "vol-08078f8696e14b62c" found

Kernel logs showed NVMe device timeout issues for devices after nvme15 (nvme16, nvme17, etc.)

[ 1000.337730] nvme nvme16: I/O tag 25 (1019) QID 1 timeout, completion polled
[ 1030.397337] nvme nvme16: I/O tag 26 (101a) QID 1 timeout, completion polled
[ 1032.307305] nvme nvme17: I/O tag 0 (1000) QID 0 timeout, completion polled

How to reproduce the problem:
In an EKS 1.34 cluster, attempt to run more than 15 pods with persistent volumes on a single node

I have opened a case with AWS support regarding this issue but have seen no traction on it, so raising it here for additional awareness. The agent assigned to the case was able to reproduce the issue.

I have worked around the problem by running smaller nodes, so that we don't hit the 15-volume limit because fewer pods are able to schedule on the smaller nodes. However, this is costing us a lot of money as our observability vendor charges us by the node, and we have had to double our node count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    status/needs-triagePending triage or re-evaluationtype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions