Skip to content

fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466

Open
caspx wants to merge 1 commit intomainfrom
fix-drive-serial-naa-prefix
Open

fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466
caspx wants to merge 1 commit intomainfrom
fix-drive-serial-naa-prefix

Conversation

@caspx
Copy link
Copy Markdown
Contributor

@caspx caspx commented Apr 17, 2026

TL;DR

Drive discovery fails on SCSI-attached storage (OCI iSCSI and similar) because RawDrives and Drives are built from different serial sources. Align the sources at the root cause (Python) and add a defensive fallback (Go).

What changed?

  • charts/weka-operator/resources/weka_runtime.pyfind_disks() now calls the existing get_device_serial_id() helper, the same logic find_weka_drives() uses. RawDrives and
    Drives now pull the serial from the same place (sysfs on NVMe, udev ID_SERIAL on SCSI). lsblk kept as last-resort fallback.
  • internal/controllers/wekacontainer/funcs_oneoff.go — on exact-match miss, try suffix match against RawDrives keys before erroring. Protects against any future serial-format
    drift, not just today's NAA prefix.

How to test?

  1. Clear node annotations: kubectl annotate nodes --all weka.io/sign-drives-hash- weka.io/weka-full-drives- weka.io/drivers-loaded-
  2. Reinstall operator with this chart, apply a WekaCluster on an OCI bliss cluster.
  3. Verify the sign-drives policy succeeds and weka.io/weka-full-drives is populated with non-zero capacity.
  4. Regression check on an NVMe-backed cluster — sign-drives still succeeds.

Why make this change?

Before OP-280 (commit b03ec7fc), a serial mismatch silently set drive capacity to 0. OP-280 tightened this into a hard error, which surfaced the bug on OCI iSCSI where lsblk SERIAL returns the short kernel serial while udev ID_SERIAL includes the full SCSI NAA prefix (3 + vendor OUI + kernel serial). The cross-reference in funcs_oneoff.go uses
exact match, so these never matched and clusters stalled in Init.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Copy Markdown
Contributor Author

caspx commented Apr 17, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@caspx caspx marked this pull request as ready for review April 17, 2026 13:00
@caspx caspx requested a review from a team as a code owner April 17, 2026 13:00
Copilot AI review requested due to automatic review settings April 17, 2026 13:00
@claude
Copy link
Copy Markdown

claude bot commented Apr 17, 2026

Claude finished @caspx's task in 2m 29s —— View job


Code Review

  • Gather context and understand the PR
  • Read changed files and analyze diff
  • Provide code review feedback

Summary

The PR fixes a serial ID mismatch between find_disks() and find_weka_drives() / the Weka API: find_disks() was reading lsblk SERIAL (no NAA prefix for SCSI), while the other path reads udev ID_SERIAL (which includes the NAA prefix). The fix aligns both to use the same source. A Go-side fallback is added for defense-in-depth.

The root cause analysis is correct and the Python fix is sound. A few issues worth addressing below.


Issues

get_serial_id_fallback() is now dead code

charts/weka-operator/resources/weka_runtime.py, line 386

get_serial_id_fallback() was previously the fallback inside find_disks(). The new code replaces it with a direct fall-through to lsblk's serial field. The function is now unreferenced and should be removed to avoid confusion. Fix this →


Fallback path in find_disks() reverts to the old (mismatched) serial source

charts/weka-operator/resources/weka_runtime.py, line ~497

if not serial_id:
    logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
    serial_id = (device.get("serial") or "").strip() or None

If get_device_serial_id() fails (e.g. sysfs/udev unavailable), the fallback is the lsblk SERIAL field — the same source that caused the bug. The drive could end up registered without the NAA prefix, reproducing the mismatch in degraded environments. The old get_serial_id_fallback() used the udev path (same as get_device_serial_id()) and would be more consistent. Consider keeping it as the fallback instead of lsblk serial, and only using lsblk as the last resort. Fix this →


Go suffix match is non-deterministic when multiple raw serials match

internal/controllers/wekacontainer/funcs_oneoff.go, lines 188–195

for rawSerial, cap := range rawDriveCapacity {
    if strings.HasSuffix(drive.SerialId, rawSerial) {
        capacity = cap
        ok = true
        break
    }
}

Map iteration order in Go is random. If two different raw serials are both suffixes of drive.SerialId (e.g. "abc" and "bc" both match "naa.0000abc"), the matched capacity is non-deterministic. In practice this shouldn't happen since real serials are 20+ hex characters, but a short comment asserting the uniqueness assumption would help reviewers. Consider also adding a check that only one match is found and logging a warning if multiple are found.


Go suffix match only handles one direction

internal/controllers/wekacontainer/funcs_oneoff.go, lines 188–195

The code assumes the NAA prefix is always on the Drives side (from the Weka API) and the raw serial is always shorter. If the NAA prefix ever appears on the RawDrives side instead, the fallback won't help. This is fine if the data flow is well-understood, but worth noting explicitly in the comment.


Minor

  • The NVMe path in find_weka_drives() navigates 3 directories up (pci_device_path.split("/")[:-2]) while get_device_serial_id() goes only 1 up (split("/")[:-1]). This asymmetry suggests get_device_serial_id() and find_weka_drives() may resolve the NVMe controller directory differently. Worth verifying these produce the same result on NVMe hardware.

Overall

The Python fix is the right approach and should resolve the SCSI NAA mismatch for the primary code path. The two concerns above (dead code + fallback reintroducing the bug in degraded environments) are worth addressing. The Go fallback is good defensive coding for in-progress rollouts.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate drive-serial mismatches between Drives and RawDrives during drive discovery (notably SCSI NAA prefix differences) so the controller can reliably correlate capacity and other metadata.

Changes:

  • Update find_disks() to use get_device_serial_id() (matching find_weka_drives()’s serial resolution intent) and adjust fallback behavior.
  • Add a suffix-based serial match fallback in the Go controller when correlating Drives to RawDrives for capacity lookup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
internal/controllers/wekacontainer/funcs_oneoff.go Adds suffix-match fallback when drive.SerialId isn’t found in RawDrives map.
charts/weka-operator/resources/weka_runtime.py Switches find_disks() serial source to get_device_serial_id() to align RawDrives/Drives serial formats.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +491 to +492
logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
serial_id = (device.get("serial") or "").strip() or None
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_disks() now relies on get_device_serial_id(device_path), but get_device_serial_id’s SCSI/SATA branch derives device_base_name from readlink -f /sys/class/block/<disk> via split('/')[-2], which for disks like /dev/sda resolves to .../block/sda and yields device_base_name == "block". That makes the udev lookup fail and forces the fallback to lsblk serial (reintroducing the RawDrives/Drives mismatch this PR is trying to fix). Consider fixing get_device_serial_id to use the actual block device name (e.g., device_name / split('/')[-1]) for /sys/block/<name>/dev and udev lookup, and consider falling back to get_serial_id_fallback() before using lsblk’s serial column.

Suggested change
logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
serial_id = (device.get("serial") or "").strip() or None
device_name = os.path.basename(device_path)
logging.warning(
f"get_device_serial_id returned empty for {device_path}, trying get_serial_id_fallback before lsblk serial"
)
serial_id = get_serial_id_fallback(device_name)
if not serial_id:
logging.warning(f"get_serial_id_fallback returned empty for {device_path}, falling back to lsblk serial")
serial_id = (device.get("serial") or "").strip() or None

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +197
// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
for rawSerial, cap := range rawDriveCapacity {
if strings.HasSuffix(drive.SerialId, rawSerial) {
capacity = cap
ok = true
break
}
}
if !ok {
return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suffix-match fallback over rawDriveCapacity iterates a Go map, so if more than one rawSerial is a suffix of drive.SerialId the chosen capacity is non-deterministic (map iteration order is randomized). It would be safer to collect all suffix matches and (a) require exactly one match, or (b) prefer the longest matching suffix, and error if the match is ambiguous.

Suggested change
// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
for rawSerial, cap := range rawDriveCapacity {
if strings.HasSuffix(drive.SerialId, rawSerial) {
capacity = cap
ok = true
break
}
}
if !ok {
return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
}
// Drives may include a SCSI NAA prefix that RawDrives lacks — prefer the longest matching suffix.
var matchedSerial string
var matchedCapacity int64
longestMatchLen := -1
ambiguous := false
for rawSerial, cap := range rawDriveCapacity {
if !strings.HasSuffix(drive.SerialId, rawSerial) {
continue
}
matchLen := len(rawSerial)
if matchLen > longestMatchLen {
matchedSerial = rawSerial
matchedCapacity = cap
longestMatchLen = matchLen
ambiguous = false
continue
}
if matchLen == longestMatchLen && rawSerial != matchedSerial {
ambiguous = true
}
}
if longestMatchLen == -1 {
return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
}
if ambiguous {
return fmt.Errorf("drive %s matches multiple RawDrives suffixes with the same specificity", drive.SerialId)
}
capacity = matchedCapacity
ok = true

Copilot uses AI. Check for mistakes.
Comment on lines 185 to 199
capacity, ok := rawDriveCapacity[drive.SerialId]
if !ok {
return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
for rawSerial, cap := range rawDriveCapacity {
if strings.HasSuffix(drive.SerialId, rawSerial) {
capacity = cap
ok = true
break
}
}
if !ok {
return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
}
}
if _, ok := seenDrives[drive.SerialId]; !ok {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This capacity lookup now tolerates Drives[i].SerialId differing from RawDrives[j].SerialId (e.g., NAA prefix), but later in this function blockedDrives is built by iterating RawDrives and checking membership in seenDrives by serial string. If RawDrives uses the unprefixed serial while seenDrives keys are prefixed (from Drives), mounted drives that were previously used by Weka won’t be blocked. Consider normalizing serial IDs (or maintaining a raw->drive serial mapping) so all subsequent logic uses a consistent canonical serial.

Copilot uses AI. Check for mistakes.
@graphite-app
Copy link
Copy Markdown

graphite-app bot commented Apr 17, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (04/17/26)

3 reviewers were added to this PR based on Anton Bykov's automation.

@caspx caspx force-pushed the fix-drive-serial-naa-prefix branch from c93e0ec to 1b3fff8 Compare April 19, 2026 07:53
@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude finished @caspx's task in 1m 53s —— View job


Claude finished @caspx's task —— View job


Code Review (updated — PR synchronize)

  • Gather context and understand the PR
  • Read changed files and analyze diff
  • Compare with previous review to identify new changes
  • Provide updated code review feedback

What changed since last review

All three issues from the previous Claude review and Copilot's get_device_serial_id SCSI path bug are now resolved. Good iteration.

Fixed:

  • get_serial_id_fallback() is no longer dead code — now used as the intermediate fallback in find_disks(): get_device_serial_id()get_serial_id_fallback() → lsblk serial (correct three-tier chain)
  • get_device_serial_id() SCSI/SATA path now correctly extracts the disk name using parts.index("block") instead of [-2]
  • Go ambiguity check: if two raw serials both suffix-match, the code now fails loudly with an error instead of picking one at random

One remaining concern

blockedDrives serial lookup can still mismatch after suffix-match (lines 239–249)

internal/controllers/wekacontainer/funcs_oneoff.go, lines 239–249

seenDrives is keyed by the Drives serial (NAA-prefixed, e.g. 3naa0000abcdef123). The blocked-drives loop iterates RawDrives and does seenDrives[drive.SerialId] where drive.SerialId is the raw unprefixed serial (abcdef123). The lookup always misses on OCI iSCSI, so a mounted drive that was previously used by Weka won't be added to blockedDrives.

This only matters in the scenario where the Go suffix-match fallback was needed (i.e., Python's get_device_serial_id failed and returned lsblk serial for RawDrives). In the primary path the Python fix aligns serials before they reach Go, making this a degraded-environment-only edge case. But if it fires, the drive would not be blocked and could be reallocated.

The fix would be to build a rawToFull mapping during the suffix-match loop and use it here, or to store RawDrives serials as keys in seenDrives alongside the full serial.

Fix this →


Overall

The Python changes are solid — the three-tier fallback chain is correct and the sysfs path parsing fix (parts.index("block")) is the right approach. The Go ambiguity guard is clean. The remaining concern above is an edge case in the degraded path, not the primary bug being fixed. The PR is in good shape.

Comment on lines +190 to +197
if !strings.HasSuffix(drive.SerialId, rawSerial) {
continue
}
if ok {
return fmt.Errorf("drive %s matches multiple RawDrives serials ambiguously", drive.SerialId)
}
capacity = capVal
ok = true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this suffix-check if the root cause is fixed in weka_runtime?

# agree. For NVMe this reads sysfs /sys/.../nvme0/serial; for SCSI/SATA it
# reads udev ID_SERIAL (which includes the NAA prefix on SCSI).
# Fall back through udev-only helper, then to lsblk serial as last resort.
serial_id = await get_device_serial_id(device_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this change or is it enough to have a separate flow for "SCSI/SATA device" under else (line 702)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants