fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290) by caspx · Pull Request #2466 · weka/weka-operator

caspx · 2026-04-17T13:00:07Z

TL;DR

Drive discovery fails on SCSI-attached storage (OCI iSCSI and similar) because RawDrives and Drives are built from different serial sources. Align the sources at the root cause (Python) and add a defensive fallback (Go).

What changed?

charts/weka-operator/resources/weka_runtime.py — find_disks() now calls the existing get_device_serial_id() helper, the same logic find_weka_drives() uses. RawDrives and
Drives now pull the serial from the same place (sysfs on NVMe, udev ID_SERIAL on SCSI). lsblk kept as last-resort fallback.
internal/controllers/wekacontainer/funcs_oneoff.go — on exact-match miss, try suffix match against RawDrives keys before erroring. Protects against any future serial-format
drift, not just today's NAA prefix.

How to test?

Clear node annotations: kubectl annotate nodes --all weka.io/sign-drives-hash- weka.io/weka-full-drives- weka.io/drivers-loaded-
Reinstall operator with this chart, apply a WekaCluster on an OCI bliss cluster.
Verify the sign-drives policy succeeds and weka.io/weka-full-drives is populated with non-zero capacity.
Regression check on an NVMe-backed cluster — sign-drives still succeeds.

Why make this change?

Before OP-280 (commit b03ec7fc), a serial mismatch silently set drive capacity to 0. OP-280 tightened this into a hard error, which surfaced the bug on OCI iSCSI where lsblk SERIAL returns the short kernel serial while udev ID_SERIAL includes the full SCSI NAA prefix (3 + vendor OUI + kernel serial). The cross-reference in funcs_oneoff.go uses
exact match, so these never matched and clusters stalled in Init.

github-actions · 2026-04-17T13:00:16Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

caspx · 2026-04-17T13:00:21Z

How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has required the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-04-17T13:00:31Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

claude · 2026-04-17T13:01:11Z

Claude finished @caspx's task in 2m 29s —— View job

Code Review

Gather context and understand the PR
Read changed files and analyze diff
Provide code review feedback

Summary

The PR fixes a serial ID mismatch between find_disks() and find_weka_drives() / the Weka API: find_disks() was reading lsblk SERIAL (no NAA prefix for SCSI), while the other path reads udev ID_SERIAL (which includes the NAA prefix). The fix aligns both to use the same source. A Go-side fallback is added for defense-in-depth.

The root cause analysis is correct and the Python fix is sound. A few issues worth addressing below.

Issues

get_serial_id_fallback() is now dead code

charts/weka-operator/resources/weka_runtime.py, line 386

get_serial_id_fallback() was previously the fallback inside find_disks(). The new code replaces it with a direct fall-through to lsblk's serial field. The function is now unreferenced and should be removed to avoid confusion. Fix this →

Fallback path in find_disks() reverts to the old (mismatched) serial source

charts/weka-operator/resources/weka_runtime.py, line ~497

if not serial_id:
    logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
    serial_id = (device.get("serial") or "").strip() or None

If get_device_serial_id() fails (e.g. sysfs/udev unavailable), the fallback is the lsblk SERIAL field — the same source that caused the bug. The drive could end up registered without the NAA prefix, reproducing the mismatch in degraded environments. The old get_serial_id_fallback() used the udev path (same as get_device_serial_id()) and would be more consistent. Consider keeping it as the fallback instead of lsblk serial, and only using lsblk as the last resort. Fix this →

Go suffix match is non-deterministic when multiple raw serials match

internal/controllers/wekacontainer/funcs_oneoff.go, lines 188–195

for rawSerial, cap := range rawDriveCapacity {
    if strings.HasSuffix(drive.SerialId, rawSerial) {
        capacity = cap
        ok = true
        break
    }
}

Map iteration order in Go is random. If two different raw serials are both suffixes of drive.SerialId (e.g. "abc" and "bc" both match "naa.0000abc"), the matched capacity is non-deterministic. In practice this shouldn't happen since real serials are 20+ hex characters, but a short comment asserting the uniqueness assumption would help reviewers. Consider also adding a check that only one match is found and logging a warning if multiple are found.

Go suffix match only handles one direction

internal/controllers/wekacontainer/funcs_oneoff.go, lines 188–195

The code assumes the NAA prefix is always on the Drives side (from the Weka API) and the raw serial is always shorter. If the NAA prefix ever appears on the RawDrives side instead, the fallback won't help. This is fine if the data flow is well-understood, but worth noting explicitly in the comment.

Minor

The NVMe path in find_weka_drives() navigates 3 directories up (pci_device_path.split("/")[:-2]) while get_device_serial_id() goes only 1 up (split("/")[:-1]). This asymmetry suggests get_device_serial_id() and find_weka_drives() may resolve the NVMe controller directory differently. Worth verifying these produce the same result on NVMe hardware.

Overall

The Python fix is the right approach and should resolve the SCSI NAA mismatch for the primary code path. The two concerns above (dead code + fallback reintroducing the bug in degraded environments) are worth addressing. The Go fallback is good defensive coding for in-progress rollouts.

Copilot

Pull request overview

This PR aims to eliminate drive-serial mismatches between Drives and RawDrives during drive discovery (notably SCSI NAA prefix differences) so the controller can reliably correlate capacity and other metadata.

Changes:

Update find_disks() to use get_device_serial_id() (matching find_weka_drives()’s serial resolution intent) and adjust fallback behavior.
Add a suffix-based serial match fallback in the Go controller when correlating Drives to RawDrives for capacity lookup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
internal/controllers/wekacontainer/funcs_oneoff.go	Adds suffix-match fallback when `drive.SerialId` isn’t found in `RawDrives` map.
charts/weka-operator/resources/weka_runtime.py	Switches `find_disks()` serial source to `get_device_serial_id()` to align RawDrives/Drives serial formats.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T13:04:40Z

+                logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
+                serial_id = (device.get("serial") or "").strip() or None


find_disks() now relies on get_device_serial_id(device_path), but get_device_serial_id’s SCSI/SATA branch derives device_base_name from readlink -f /sys/class/block/<disk> via split('/')[-2], which for disks like /dev/sda resolves to .../block/sda and yields device_base_name == "block". That makes the udev lookup fail and forces the fallback to lsblk serial (reintroducing the RawDrives/Drives mismatch this PR is trying to fix). Consider fixing get_device_serial_id to use the actual block device name (e.g., device_name / split('/')[-1]) for /sys/block/<name>/dev and udev lookup, and consider falling back to get_serial_id_fallback() before using lsblk’s serial column.

Suggested change

logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")

serial_id = (device.get("serial") or "").strip() or None

device_name = os.path.basename(device_path)

logging.warning(

f"get_device_serial_id returned empty for {device_path}, trying get_serial_id_fallback before lsblk serial"

)

serial_id = get_serial_id_fallback(device_name)

if not serial_id:

logging.warning(f"get_serial_id_fallback returned empty for {device_path}, falling back to lsblk serial")

serial_id = (device.get("serial") or "").strip() or None

Copilot · 2026-04-17T13:04:40Z

+			// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
+			for rawSerial, cap := range rawDriveCapacity {
+				if strings.HasSuffix(drive.SerialId, rawSerial) {
+					capacity = cap
+					ok = true
+					break
+				}
+			}
+			if !ok {
+				return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
+			}


The suffix-match fallback over rawDriveCapacity iterates a Go map, so if more than one rawSerial is a suffix of drive.SerialId the chosen capacity is non-deterministic (map iteration order is randomized). It would be safer to collect all suffix matches and (a) require exactly one match, or (b) prefer the longest matching suffix, and error if the match is ambiguous.

Suggested change

// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match

for rawSerial, cap := range rawDriveCapacity {

if strings.HasSuffix(drive.SerialId, rawSerial) {

capacity = cap

ok = true

break

}

}

if !ok {

return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)

}

// Drives may include a SCSI NAA prefix that RawDrives lacks — prefer the longest matching suffix.

var matchedSerial string

var matchedCapacity int64

longestMatchLen := -1

ambiguous := false

for rawSerial, cap := range rawDriveCapacity {

if !strings.HasSuffix(drive.SerialId, rawSerial) {

continue

}

matchLen := len(rawSerial)

if matchLen > longestMatchLen {

matchedSerial = rawSerial

matchedCapacity = cap

longestMatchLen = matchLen

ambiguous = false

continue

}

if matchLen == longestMatchLen && rawSerial != matchedSerial {

ambiguous = true

}

}

if longestMatchLen == -1 {

return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)

}

if ambiguous {

return fmt.Errorf("drive %s matches multiple RawDrives suffixes with the same specificity", drive.SerialId)

}

capacity = matchedCapacity

ok = true

Copilot · 2026-04-17T13:04:41Z

 		capacity, ok := rawDriveCapacity[drive.SerialId]
 		if !ok {
-			return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
+			// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
+			for rawSerial, cap := range rawDriveCapacity {
+				if strings.HasSuffix(drive.SerialId, rawSerial) {
+					capacity = cap
+					ok = true
+					break
+				}
+			}
+			if !ok {
+				return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
+			}
 		}
 		if _, ok := seenDrives[drive.SerialId]; !ok {


This capacity lookup now tolerates Drives[i].SerialId differing from RawDrives[j].SerialId (e.g., NAA prefix), but later in this function blockedDrives is built by iterating RawDrives and checking membership in seenDrives by serial string. If RawDrives uses the unprefixed serial while seenDrives keys are prefixed (from Drives), mounted drives that were previously used by Weka won’t be blocked. Consider normalizing serial IDs (or maintaining a raw->drive serial mapping) so all subsequent logic uses a consistent canonical serial.

graphite-app · 2026-04-17T13:05:10Z

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (04/17/26)

3 reviewers were added to this PR based on Anton Bykov's automation.

…efix mismatch (OP-290)

claude · 2026-04-19T07:53:49Z

Claude finished @caspx's task in 1m 53s —— View job

Claude finished @caspx's task —— View job

Code Review (updated — PR synchronize)

Gather context and understand the PR
Read changed files and analyze diff
Compare with previous review to identify new changes
Provide updated code review feedback

What changed since last review

All three issues from the previous Claude review and Copilot's get_device_serial_id SCSI path bug are now resolved. Good iteration.

Fixed:

get_serial_id_fallback() is no longer dead code — now used as the intermediate fallback in find_disks(): get_device_serial_id() → get_serial_id_fallback() → lsblk serial (correct three-tier chain)
get_device_serial_id() SCSI/SATA path now correctly extracts the disk name using parts.index("block") instead of [-2]
Go ambiguity check: if two raw serials both suffix-match, the code now fails loudly with an error instead of picking one at random

One remaining concern

blockedDrives serial lookup can still mismatch after suffix-match (lines 239–249)

internal/controllers/wekacontainer/funcs_oneoff.go, lines 239–249

seenDrives is keyed by the Drives serial (NAA-prefixed, e.g. 3naa0000abcdef123). The blocked-drives loop iterates RawDrives and does seenDrives[drive.SerialId] where drive.SerialId is the raw unprefixed serial (abcdef123). The lookup always misses on OCI iSCSI, so a mounted drive that was previously used by Weka won't be added to blockedDrives.

This only matters in the scenario where the Go suffix-match fallback was needed (i.e., Python's get_device_serial_id failed and returned lsblk serial for RawDrives). In the primary path the Python fix aligns serials before they reach Go, making this a degraded-environment-only edge case. But if it fires, the drive would not be blocked and could be reallocated.

The fix would be to build a rawToFull mapping during the suffix-match loop and use it here, or to store RawDrives serials as keys in seenDrives alongside the full serial.

Fix this →

Overall

The Python changes are solid — the three-tier fallback chain is correct and the sysfs path parsing fix (parts.index("block")) is the right approach. The Go ambiguity guard is clean. The remaining concern above is an edge case in the degraded path, not the primary bug being fixed. The PR is in good shape.

kristina-solovyova · 2026-04-20T07:49:40Z

+				if !strings.HasSuffix(drive.SerialId, rawSerial) {
+					continue
+				}
+				if ok {
+					return fmt.Errorf("drive %s matches multiple RawDrives serials ambiguously", drive.SerialId)
+				}
+				capacity = capVal
+				ok = true


do we still need this suffix-check if the root cause is fixed in weka_runtime?

kristina-solovyova · 2026-04-20T07:51:34Z

+            # agree. For NVMe this reads sysfs /sys/.../nvme0/serial; for SCSI/SATA it
+            # reads udev ID_SERIAL (which includes the NAA prefix on SCSI).
+            # Fall back through udev-only helper, then to lsblk serial as last resort.
+            serial_id = await get_device_serial_id(device_path)


do we need this change or is it enough to have a separate flow for "SCSI/SATA device" under else (line 702)?

caspx marked this pull request as ready for review April 17, 2026 13:00

caspx requested a review from a team as a code owner April 17, 2026 13:00

Copilot AI review requested due to automatic review settings April 17, 2026 13:00

Copilot started reviewing on behalf of caspx April 17, 2026 13:00 View session

graphite-app bot requested review from assafgi, kristina-solovyova and tigrawap April 17, 2026 13:01

Copilot AI reviewed Apr 17, 2026

View reviewed changes

fix: align drive serial source in find_disks() to prevent SCSI NAA pr…

1b3fff8

…efix mismatch (OP-290)

caspx force-pushed the fix-drive-serial-naa-prefix branch from c93e0ec to 1b3fff8 Compare April 19, 2026 07:53

kristina-solovyova reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466

fix: align drive serial source in find_disks() to prevent SCSI NAA prefix mismatch (OP-290)#2466
caspx wants to merge 1 commit intomainfrom
fix-drive-serial-naa-prefix

caspx commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

caspx commented Apr 17, 2026

Uh oh!

claude bot commented Apr 17, 2026

Uh oh!

claude bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

graphite-app bot commented Apr 17, 2026

Uh oh!

claude bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

kristina-solovyova Apr 20, 2026

Uh oh!

kristina-solovyova Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logging.warning(f"get_device_serial_id returned empty for {device_path}, falling back to lsblk serial")
		serial_id = (device.get("serial") or "").strip() or None

-			// Drives may include a SCSI NAA prefix that RawDrives lacks — try suffix match
-			for rawSerial, cap := range rawDriveCapacity {
-				if strings.HasSuffix(drive.SerialId, rawSerial) {
-					capacity = cap
-					ok = true
-					break
-				}
-			}
-			if !ok {
-				return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
-			}
+			// Drives may include a SCSI NAA prefix that RawDrives lacks — prefer the longest matching suffix.
+			var matchedSerial string
+			var matchedCapacity int64
+			longestMatchLen := -1
+			ambiguous := false
+			for rawSerial, cap := range rawDriveCapacity {
+				if !strings.HasSuffix(drive.SerialId, rawSerial) {
+					continue
+				}
+				matchLen := len(rawSerial)
+				if matchLen > longestMatchLen {
+					matchedSerial = rawSerial
+					matchedCapacity = cap
+					longestMatchLen = matchLen
+					ambiguous = false
+					continue
+				}
+				if matchLen == longestMatchLen && rawSerial != matchedSerial {
+					ambiguous = true
+				}
+			}
+			if longestMatchLen == -1 {
+				return fmt.Errorf("drive %s present in Drives but missing from RawDrives", drive.SerialId)
+			}
+			if ambiguous {
+				return fmt.Errorf("drive %s matches multiple RawDrives suffixes with the same specificity", drive.SerialId)
+			}
+			capacity = matchedCapacity
+			ok = true

Conversation

caspx commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What changed?

How to test?

Why make this change?

Uh oh!

github-actions bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

caspx commented Apr 17, 2026

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Apr 17, 2026

Uh oh!

claude bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Issues

Minor

Overall

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

graphite-app bot commented Apr 17, 2026

Graphite Automations

Uh oh!

claude bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (updated — PR synchronize)

What changed since last review

One remaining concern

Overall

Uh oh!

kristina-solovyova Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

kristina-solovyova Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

caspx commented Apr 17, 2026 •

edited

Loading

github-actions bot commented Apr 17, 2026 •

edited

Loading

claude bot commented Apr 17, 2026 •

edited

Loading

claude bot commented Apr 19, 2026 •

edited

Loading