DO NOT MERGE/WIP: Retry all VFIO operations#2665
DO NOT MERGE/WIP: Retry all VFIO operations#2665mattkur wants to merge 1 commit intomicrosoft:mainfrom
Conversation
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Pull request overview
This WIP pull request refactors VFIO error handling by moving retry logic from the low-level vfio_sys crate to the higher-level user_driver crate. The changes introduce a VfioErrata enum to classify errors and a VfioDelayWorkaroundConfig helper to centralize retry logic for working around a kernel race condition.
Changes:
- Introduced VfioErrata enum to identify ENODEV errors requiring the delay workaround
- Removed driver parameter and built-in retry logic from vfio_sys::Group::open_device and set_keep_alive methods
- Added VfioDelayWorkaroundConfig helper with retry_on_enodev and retry_unconditional methods
- Applied retry logic to open_noiommu operation in VfioDevice::restore
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| vm/devices/user_driver/vfio_sys/src/lib.rs | Adds VfioErrata enum for error classification; removes driver parameter and retry logic from open_device and set_keep_alive; removes unused async driver imports |
| vm/devices/user_driver/src/vfio.rs | Adds VfioDelayWorkaroundConfig helper struct with retry methods; applies retry_unconditional to open_noiommu; updates method signatures to match vfio_sys changes |
| // There is a small race window in the 6.1 kernel between when the | ||
| // vfio device is visible to userspace, and when it is added to its | ||
| // internal list. Try one more time on ENODEV failure after a brief | ||
| // sleep. | ||
| pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | ||
| where | ||
| F: AsyncFnMut(bool) -> anyhow::Result<T>, | ||
| { | ||
| match f(false).await { | ||
| Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => { | ||
| tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay"); | ||
| self.sleep().await; | ||
| f(true).await | ||
| } | ||
| r => r, | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The retry_on_enodev method is defined but never used in the code. Either use it to wrap open_device and set_keep_alive calls (which previously had ENODEV retry logic), or remove it if retry_unconditional is the intended approach for all operations.
| // There is a small race window in the 6.1 kernel between when the | |
| // vfio device is visible to userspace, and when it is added to its | |
| // internal list. Try one more time on ENODEV failure after a brief | |
| // sleep. | |
| pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | |
| where | |
| F: AsyncFnMut(bool) -> anyhow::Result<T>, | |
| { | |
| match f(false).await { | |
| Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => { | |
| tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay"); | |
| self.sleep().await; | |
| f(true).await | |
| } | |
| r => r, | |
| } | |
| } |
| if observed.downcast_ref::<nix::errno::Errno>() == Some(&nix::errno::Errno::ENODEV) { | ||
| return VfioErrata::DelayWorkaround; | ||
| } | ||
| return VfioErrata::None; |
There was a problem hiding this comment.
Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.
| return VfioErrata::None; | |
| VfioErrata::None |
| return VfioErrata::DelayWorkaround; | ||
| } | ||
| return VfioErrata::None; |
There was a problem hiding this comment.
Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.
| return VfioErrata::DelayWorkaround; | |
| } | |
| return VfioErrata::None; | |
| VfioErrata::DelayWorkaround | |
| } else { | |
| VfioErrata::None | |
| } |
| pub async fn retry_unconditional<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | ||
| where | ||
| F: AsyncFnMut(bool) -> anyhow::Result<T>, |
There was a problem hiding this comment.
The AsyncFnMut closure accepts a boolean parameter 'is_retry' but it's not used by the calling code. In the usage at line 131-133, the closure ignores this parameter with '_'. Consider either removing this unused parameter from the trait bound, or documenting its intended purpose if it's meant for future use.
| group.set_keep_alive(pci_id).await?; | ||
| } | ||
| tracing::debug!(pci_id, "about to open device"); | ||
| let device = group.open_device(pci_id, &driver).await?; | ||
| let device = group.open_device(pci_id).await?; |
There was a problem hiding this comment.
The retry logic has been removed but open_device and set_keep_alive are not being retried in the calling code. According to the PR title "Retry all VFIO operations", these operations should be wrapped with retry logic. Currently, only open_noiommu is being retried (line 130-134), but open_device (line 146) and set_keep_alive (line 143) are called without retry wrappers, even though they originally had ENODEV retry logic that was removed from vfio_sys.
| let group_id = vfio_sys::Group::find_group_for_device(&path)?; | ||
| let group = vfio_sys::Group::open_noiommu(group_id)?; | ||
| let group = workaround | ||
| .retry_unconditional("open_noiommu", async |_| { |
There was a problem hiding this comment.
The retry_unconditional method retries on any error, which is overly broad for the stated purpose of working around a specific kernel race condition (ENODEV). This could mask real errors and retry operations that should fail fast. Consider using retry_on_enodev instead, or document why unconditional retry is necessary for this specific operation.
| .retry_unconditional("open_noiommu", async |_| { | |
| .retry_on_enodev("open_noiommu", async |_| { |
No description provided.