feat: NVMe-oF multipath support (head-device discovery + multi-endpoint connect)#12
Conversation
Under native NVMe multipath (CONFIG_NVME_MULTIPATH=Y, the kernel default) each namespace appears in /sys/block both as the multipath head (nvme<C>n<N>, which has a /dev node) and as one per-path device (nvme<C>c<P>n<N>, which does not). The previous /sys/class/nvme walk built the device name from a path controller's number, which only equals the head when there is a single path; with multiple paths it produced a name with no /dev node and discovery failed. Rewrite the resolver to enumerate /sys/block, consider only head namespaces, match by subsystem NQN and nsid, and return the head's /dev node. Factor the sysfs/dev roots and the block-device test into overridable hooks so the logic is unit-testable, and add t/find_nvme_device.t (head selection vs per-path, multi-path collapse, non-multipath, older-kernel subsysnqn fallback, and the nsid/NQN/no-dev-node negatives). Verified on PVE 9.2.2 (multipath=Y): attach resolves to the head /dev/nvme0n1, not the per-path nvme0c0n1. Multi-path positive/negative (multiple live paths, failover) still needs a multi-node cluster to validate.
activate_volume now connects to every endpoint in lb_nvme_host (a comma-separated host:port list) instead of just one. On a multi-node LightOS (ANA) cluster a single connection can land on a non-optimized path, so the volume's namespace never becomes accessible and attach fails with 'block device did not appear'; connecting to all data nodes guarantees the optimized path is present, and gives transparent failover to another replica when a node goes down. Factor endpoint parsing into _nvme_endpoints (comma split, whitespace trim, default port 4420, rightmost-colon so bracketed IPv6 works) and add t/nvme_endpoints.t. Pairs with the multipath head-device resolution fix. Validated end-to-end on a 3-node cluster: VM on Proxmox over 3 paths, head device resolved, and a live reboot of the optimized node failed over to another replica with I/O resuming after a brief pause.
…nnect, IPv6, PII From the security review of the multipath branch: - deactivate_volume: disconnect the subsystem only when no volume of ANY storage on the host still maps it, decided from local symlinks instead of the per-storeid API listing. The old check could (a) fire on a transient API error and (b) ignore a second storage sharing the same cluster — and since 'nvme disconnect -n' is subsystem-wide (drops every path), that tore down in-flight I/O for other volumes/storages. Adds _nqn_still_in_use. - activate_volume: gate the multi-endpoint connect per endpoint (via _connected_endpoints) instead of per subsystem. The old _is_connected gate meant that if some endpoints failed on first activation the volume stayed permanently single-path, so a later node failure (ctrl-loss-tmo=-1) would hang VMs instead of failing over. - _nvme_endpoints: strip IPv6 brackets ([v6]:port / [v6]) and untaint all branches; update the test to expect the unbracketed host. - t/nvme_endpoints.t: replace real lab data-NIC IPs with 10.0.0.x. Re-validated on the 3-node cluster: VM start connects all 3 paths, head device resolves, QEMU uses it; VM stop disconnects cleanly.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds multipath NVMe-oF support: parse comma-separated lb_nvme_host endpoints, ensure connection to all endpoints during activation, resolve multipath head namespace devices via /sys/block enumeration, and prevent unsafe subsystem disconnects using local symlink checks. ChangesNVMe-oF Multipath and Safe Deactivation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 18: Replace the inconsistent "Multi-path" phrasing with the standardized
lowercase, no-hyphen term "multipath" across the changelog; specifically update
the header text "Multi-path NVMe-oF: `lb_nvme_host` accepts..." to read
"multipath NVMe-oF" and verify other occurrences (e.g., "multipath head" and
"native NVMe multipath") match this same lowercase, no-hyphen style for
consistency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3e27d4be-fc7f-4efc-80fc-48a43aeda7ac
📒 Files selected for processing (4)
CHANGELOG.mdLightbitsPlugin.pmt/find_nvme_device.tt/nvme_endpoints.t
Summary
Makes the plugin work correctly with native NVMe-oF multipath against a multi-node LightOS (ANA) cluster. Three related changes:
_find_nvme_device): resolve the namespace head device (/dev/nvme<C>n<N>) from/sys/block, skipping per-pathnvme<C>c<P>n<N>entries (which have no/devnode). The previous/sys/class/nvmewalk built a device name from a path controller, which only equals the head with a single path and returned a non-existent node under multipath.activate_volume):lb_nvme_hostaccepts a comma-separated list ofhost:portdata endpoints; the plugin connects to each (per-endpoint gating via_connected_endpoints, so a path that failed once is retried later instead of leaving the volume permanently single-path). On a multi-node ANA cluster this guarantees the volume's optimized path is present — a single connection can land on a non-optimized path and the namespace then never becomes accessible — and gives transparent failover when a node goes down. Reconnect tuning:--keep-alive-tmo=30 --reconnect-delay=10 --ctrl-loss-tmo=-1.deactivate_volume): disconnect the subsystem only when no volume of any storage on the host still maps it, decided from local symlinks rather than the REST API (nvme disconnect -nis subsystem-wide, so the old per-storeid/API-derived check could drop in-use paths for a second storage sharing the cluster, or fire on a transient API error).Testing
Unit: 114 tests pass (taint-clean,
perl -T -c), including newt/find_nvme_device.t(head vs per-path selection, multi-controller→one head, non-multipath, older-kernelsubsysnqnfallback, nsid/NQN/no-/dev-node negatives) andt/nvme_endpoints.t(comma split, whitespace, default port, bracketed IPv6).Extensive end-to-end testing was performed on a real 3-node LightOS cluster (PVE 9.2.2 guest, 3 data paths), covering positive, negative, and simulated-failover scenarios:
cN) devices with no/devnode are never returned; lone per-path entries, nsid/NQN mismatches, and missing-/dev-node heads all resolve to "not found" rather than a wrong device.ctrl-loss-tmo=-1), the cluster re-optimized ANA to another replica, the head device survived, and I/O resumed at full throughput; the node rejoined as a healthy path afterward. Replica-count and single-vs-multi-replica path behavior were also exercised.Security review
Reviewed by the security agent with a specific focus on data-loss when multiple clusters/storages share one Proxmox host; all findings addressed (safe subsystem teardown, per-endpoint connect, IPv6 parsing, no PII in tests). Verified clean to merge.
Notes
lb_nvme_hostwith IP endpoints._find_nvme_deviceto close a narrow nsid-reuse race.Summary by CodeRabbit
New Features
Bug Fixes