fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466
fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
How to use the Graphite Merge QueueAdd the label main-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has required the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
I'll analyze this and get back to you. |
|
Claude finished @caspx's task in 2m 29s —— View job Code Review
SummaryThe PR fixes a serial ID mismatch between The root cause analysis is correct and the Python fix is sound. A few issues worth addressing below. Issues
Fallback path in
if not serial_id:
logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
serial_id = (device.get("serial") or "").strip() or NoneIf Go suffix match is non-deterministic when multiple raw serials match
for rawSerial, cap := range rawDriveCapacity {
if strings.HasSuffix(drive.SerialId, rawSerial) {
capacity = cap
ok = true
break
}
}Map iteration order in Go is random. If two different raw serials are both suffixes of Go suffix match only handles one direction
The code assumes the NAA prefix is always on the Minor
OverallThe Python fix is the right approach and should resolve the SCSI NAA mismatch for the primary code path. The two concerns above (dead code + fallback reintroducing the bug in degraded environments) are worth addressing. The Go fallback is good defensive coding for in-progress rollouts. |
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate drive-serial mismatches between Drives and RawDrives during drive discovery (notably SCSI NAA prefix differences) so the controller can reliably correlate capacity and other metadata.
Changes:
- Update
find_disks()to useget_device_serial_id()(matchingfind_weka_drives()’s serial resolution intent) and adjust fallback behavior. - Add a suffix-based serial match fallback in the Go controller when correlating
DrivestoRawDrivesfor capacity lookup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| internal/controllers/wekacontainer/funcs_oneoff.go | Adds suffix-match fallback when drive.SerialId isn’t found in RawDrives map. |
| charts/weka-operator/resources/weka_runtime.py | Switches find_disks() serial source to get_device_serial_id() to align RawDrives/Drives serial formats. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial") | ||
| serial_id = (device.get("serial") or "").strip() or None |
There was a problem hiding this comment.
find_disks() now relies on get_device_serial_id(device_path), but get_device_serial_id’s SCSI/SATA branch derives device_base_name from readlink -f /sys/class/block/<disk> via split('/')[-2], which for disks like /dev/sda resolves to .../block/sda and yields device_base_name == "block". That makes the udev lookup fail and forces the fallback to lsblk serial (reintroducing the RawDrives/Drives mismatch this PR is trying to fix). Consider fixing get_device_serial_id to use the actual block device name (e.g., device_name / split('/')[-1]) for /sys/block/<name>/dev and udev lookup, and consider falling back to get_serial_id_fallback() before using lsblk’s serial column.
| logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial") | |
| serial_id = (device.get("serial") or "").strip() or None | |
| device_name = os.path.basename(device_path) | |
| logging.warning( | |
| f"get_device_serial_id returned empty for {device_path}, trying get_serial_id_fallback before lsblk serial" | |
| ) | |
| serial_id = get_serial_id_fallback(device_name) | |
| if not serial_id: | |
| logging.warning(f"get_serial_id_fallback returned empty for {device_path}, falling back to lsblk serial") | |
| serial_id = (device.get("serial") or "").strip() or None |
| // Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match | ||
| for rawSerial, cap := range rawDriveCapacity { | ||
| if strings.HasSuffix(drive.SerialId, rawSerial) { | ||
| capacity = cap | ||
| ok = true | ||
| break | ||
| } | ||
| } | ||
| if !ok { | ||
| return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId) | ||
| } |
There was a problem hiding this comment.
The suffix-match fallback over rawDriveCapacity iterates a Go map, so if more than one rawSerial is a suffix of drive.SerialId the chosen capacity is non-deterministic (map iteration order is randomized). It would be safer to collect all suffix matches and (a) require exactly one match, or (b) prefer the longest matching suffix, and error if the match is ambiguous.
| // Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match | |
| for rawSerial, cap := range rawDriveCapacity { | |
| if strings.HasSuffix(drive.SerialId, rawSerial) { | |
| capacity = cap | |
| ok = true | |
| break | |
| } | |
| } | |
| if !ok { | |
| return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId) | |
| } | |
| // Drives may include a SCSI NAA prefix that RawDrives lacks — prefer the longest matching suffix. | |
| var matchedSerial string | |
| var matchedCapacity int64 | |
| longestMatchLen := -1 | |
| ambiguous := false | |
| for rawSerial, cap := range rawDriveCapacity { | |
| if !strings.HasSuffix(drive.SerialId, rawSerial) { | |
| continue | |
| } | |
| matchLen := len(rawSerial) | |
| if matchLen > longestMatchLen { | |
| matchedSerial = rawSerial | |
| matchedCapacity = cap | |
| longestMatchLen = matchLen | |
| ambiguous = false | |
| continue | |
| } | |
| if matchLen == longestMatchLen && rawSerial != matchedSerial { | |
| ambiguous = true | |
| } | |
| } | |
| if longestMatchLen == -1 { | |
| return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId) | |
| } | |
| if ambiguous { | |
| return fmt.Errorf("drive %s matches multiple RawDrives suffixes with the same specificity", drive.SerialId) | |
| } | |
| capacity = matchedCapacity | |
| ok = true |
| capacity, ok := rawDriveCapacity[drive.SerialId] | ||
| if !ok { | ||
| return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId) | ||
| // Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match | ||
| for rawSerial, cap := range rawDriveCapacity { | ||
| if strings.HasSuffix(drive.SerialId, rawSerial) { | ||
| capacity = cap | ||
| ok = true | ||
| break | ||
| } | ||
| } | ||
| if !ok { | ||
| return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId) | ||
| } | ||
| } | ||
| if _, ok := seenDrives[drive.SerialId]; !ok { |
There was a problem hiding this comment.
This capacity lookup now tolerates Drives[i].SerialId differing from RawDrives[j].SerialId (e.g., NAA prefix), but later in this function blockedDrives is built by iterating RawDrives and checking membership in seenDrives by serial string. If RawDrives uses the unprefixed serial while seenDrives keys are prefixed (from Drives), mounted drives that were previously used by Weka won’t be blocked. Consider normalizing serial IDs (or maintaining a raw->drive serial mapping) so all subsequent logic uses a consistent canonical serial.
Graphite Automations"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (04/17/26)3 reviewers were added to this PR based on Anton Bykov's automation. |
c93e0ec to
1b3fff8
Compare
|
Claude finished @caspx's task in 1m 53s —— View job Claude finished @caspx's task —— View job Code Review (updated — PR synchronize)
What changed since last reviewAll three issues from the previous Claude review and Copilot's Fixed:
One remaining concern
This only matters in the scenario where the Go suffix-match fallback was needed (i.e., Python's The fix would be to build a OverallThe Python changes are solid — the three-tier fallback chain is correct and the sysfs path parsing fix ( |
| if !strings.HasSuffix(drive.SerialId, rawSerial) { | ||
| continue | ||
| } | ||
| if ok { | ||
| return fmt.Errorf("drive %s matches multiple RawDrives serials ambiguously", drive.SerialId) | ||
| } | ||
| capacity = capVal | ||
| ok = true |
There was a problem hiding this comment.
do we still need this suffix-check if the root cause is fixed in weka_runtime?
| # agree. For NVMe this reads sysfs /sys/.../nvme0/serial; for SCSI/SATA it | ||
| # reads udev ID_SERIAL (which includes the NAA prefix on SCSI). | ||
| # Fall back through udev-only helper, then to lsblk serial as last resort. | ||
| serial_id = await get_device_serial_id(device_path) |
There was a problem hiding this comment.
do we need this change or is it enough to have a separate flow for "SCSI/SATA device" under else (line 702)?


TL;DR
Drive discovery fails on SCSI-attached storage (OCI iSCSI and similar) because
RawDrivesandDrivesare built from different serial sources. Align the sources at the root cause (Python) and add a defensive fallback (Go).What changed?
charts/weka-operator/resources/weka_runtime.py—find_disks()now calls the existingget_device_serial_id()helper, the same logicfind_weka_drives()uses.RawDrivesandDrivesnow pull the serial from the same place (sysfs on NVMe, udevID_SERIALon SCSI).lsblkkept as last-resort fallback.internal/controllers/wekacontainer/funcs_oneoff.go— on exact-match miss, try suffix match againstRawDriveskeys before erroring. Protects against any future serial-formatdrift, not just today's NAA prefix.
How to test?
kubectl annotate nodes --all weka.io/sign-drives-hash- weka.io/weka-full-drives- weka.io/drivers-loaded-weka.io/weka-full-drivesis populated with non-zero capacity.Why make this change?
Before OP-280 (commit
b03ec7fc), a serial mismatch silently set drive capacity to 0. OP-280 tightened this into a hard error, which surfaced the bug on OCI iSCSI wherelsblk SERIALreturns the short kernel serial while udevID_SERIALincludes the full SCSI NAA prefix (3+ vendor OUI + kernel serial). The cross-reference infuncs_oneoff.gousesexact match, so these never matched and clusters stalled in Init.