Skip to content

fix: include ssdproxy unaccounted DPDK memory on ssd_proxy_includes_dpdk_memory feature flag (OP-272)#2405

Open
kristina-solovyova wants to merge 1 commit intomainfrom
03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_
Open

fix: include ssdproxy unaccounted DPDK memory on ssd_proxy_includes_dpdk_memory feature flag (OP-272)#2405
kristina-solovyova wants to merge 1 commit intomainfrom
03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_

Conversation

@kristina-solovyova
Copy link
Copy Markdown
Collaborator

@kristina-solovyova kristina-solovyova commented Mar 27, 2026

This change introduces a new feature flag SsdProxyIncludesDpdkMemory that modifies how hugepages memory is calculated for containers. When enabled, the full hugepages amount is passed to the --memory parameter instead of subtracting the DPDK offset, as the SSD proxy now accounts for DPDK memory internally.

The PodFactory constructor now accepts an optional FeatureFlags parameter, which is passed through to GetHugePagesDetails() to determine the appropriate memory calculation behavior. For ad-hoc containers used to fetch feature flags, nil is passed to maintain backward compatibility.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@kristina-solovyova kristina-solovyova marked this pull request as ready for review March 27, 2026 12:17
@kristina-solovyova kristina-solovyova requested a review from a team as a code owner March 27, 2026 12:17
Copilot AI review requested due to automatic review settings March 27, 2026 12:17
Copy link
Copy Markdown
Collaborator Author

kristina-solovyova commented Mar 27, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@graphite-app
Copy link
Copy Markdown

graphite-app bot commented Mar 27, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (03/27/26)

2 reviewers were added to this PR based on Anton Bykov's automation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new image feature flag to control how SSDProxy hugepages memory is translated into the --memory/MEMORY value, so that (when enabled) SSDProxy includes previously unaccounted DPDK memory.

Changes:

  • Add ssd_proxy_includes_dpdk_memory to the image feature flags model.
  • Plumb feature flags into resources.PodFactory and ensurePod, and use them when computing hugepages memory details.
  • Update NewPodFactory call sites in tests to match the new signature.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/pkg/domain/feature_flags.go Adds the new feature flag field to the serialized flags struct.
internal/controllers/wekacontainer/funcs_pod_ensure.go Fetches feature flags (except for the feature-flags ad-hoc mode) and passes them into PodFactory.
internal/controllers/resources/pod.go Extends PodFactory to accept optional feature flags and adjusts hugepages memory calculation accordingly.
internal/controllers/wekacontainer/funcs_pod_ensure_test.go Updates tests for the new NewPodFactory signature.
internal/controllers/operations/load_drivers_test.go Updates tests for the new NewPodFactory signature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controllers/wekacontainer/funcs_pod_ensure.go
Comment thread internal/pkg/domain/feature_flags.go Outdated
Comment thread internal/controllers/resources/pod.go Outdated
@kristina-solovyova kristina-solovyova force-pushed the 03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_ branch from 939c9c8 to d562610 Compare March 27, 2026 12:32
@weka weka deleted a comment from claude bot Mar 27, 2026
@claude
Copy link
Copy Markdown

claude bot commented Mar 27, 2026

Claude finished @kristina-solovyova's task in 2m 9s —— View job


Code Review

  • Gather context from PR
  • Read changed files
  • Analyze code quality and correctness
  • Post review feedback

The implementation is clean and correct. A few notes below.


Correctness of the Copilot comments

Index numbering comment (feature_flags.go): Copilot suggested renaming // 9 to // 8. This is incorrect — the comment index is the bit position in the bitmap, not a sequential counter. ssd_proxy_includes_dpdk_memory lives at bit 9 in the spec, and the Python FeaturesFlags class confirms this. The current numbering is right; the "flag 8 is not yet added" comment correctly documents the gap.

Mode check (pod.go): Copilot claimed the SSDProxy mode guard was missing in GetHugePagesDetails. It's actually there:

if ff != nil && ff.SsdProxyIncludesDpdkMemory && container.Spec.Mode == weka.WekaContainerModeSSDProxy {

This comment is stale/wrong.

Feature flag fetching for all modes (funcs_pod_ensure.go): Copilot said flags are fetched for all non-adhoc modes. The actual implementation correctly scopes the fetch to SSDProxy only:

if container.Spec.Mode == weka.WekaContainerModeSSDProxy {
    ff, err = r.GetFeatureFlags(ctx)
    ...
}

All three Copilot comments are addressed or incorrect.


Issues Found

1. Inconsistent comment wording between Go and Python (minor)

feature_flags.go:15:

// flag 8 is not yet added

weka_runtime.py:292:

# flag 8 is not used by the operator

"Not yet added" implies a future addition, while "not used by the operator" is a factual statement about the current state. The Python wording is more accurate since flag 8 exists in the spec but the operator simply doesn't need it. Suggest aligning the Go comment to match: // flag 8 is not used by the operator. Fix this →

2. No unit tests for the new memory calculation path (moderate)

There are no test files in internal/controllers/resources/ covering GetHugePagesDetails. The new conditional branch (SsdProxyIncludesDpdkMemory == true → pass full hugepages, false → subtract offset) is untested. Given that a wrong value here directly affects pod resource limits and DPDK memory usage, a table-driven test covering the four cases (flag nil, flag false, flag true+SSDProxy, flag true+non-SSDProxy) would provide good protection. Fix this →


What's Good

  • Computing hgDetails once at the top of Create() and threading it through to setResources() is a clean refactor — avoids repeated computation and makes the flow explicit.
  • Removing the getHugePagesDetails() and getHugePagesOffset() private wrapper methods reduces indirection.
  • Nil-check on ff before accessing the flag makes backward compatibility explicit and safe (nil → old behavior).
  • Limiting GetFeatureFlags calls to SSDProxy mode only is the right performance call — no extra API calls for other container modes.
  • The defensive mode check in GetHugePagesDetails (container.Spec.Mode == weka.WekaContainerModeSSDProxy) is a good guard even though ff will only be non-nil for SSDProxy at current call sites.

Summary: Two concerns — a trivial comment wording inconsistency and the absence of unit tests for the new memory calculation logic. The core implementation is correct and the Copilot review comments are either already addressed or wrong.

Copilot AI review requested due to automatic review settings March 31, 2026 06:05
@kristina-solovyova kristina-solovyova force-pushed the 03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_ branch from d562610 to c186afd Compare March 31, 2026 06:05
@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Claude finished @kristina-solovyova's task in 39s —— View job


No code changes since the previous review on 2026-03-27. The branch tip is still c186afd6.

Previous review findings remain open:

  1. Comment wording (feature_flags.go:15): // flag 8 is not yet added should be // flag 8 is not used by the operator to match weka_runtime.py.
  2. Missing unit tests: GetHugePagesDetails has no test coverage for the four cases (ff=nil, ff.SsdProxyIncludesDpdkMemory=false, true+SSDProxy, true+non-SSDProxy).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controllers/wekacontainer/funcs_proxy.go Outdated
Comment thread internal/controllers/resources/pod.go Outdated
Comment thread internal/controllers/resources/pod.go
Comment thread internal/controllers/resources/pod.go Outdated
@kristina-solovyova kristina-solovyova force-pushed the 03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_ branch 2 times, most recently from 3eae9ca to 72d93be Compare April 15, 2026 13:13
Copilot AI review requested due to automatic review settings April 15, 2026 13:13
@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Claude finished @kristina-solovyova's task in 4m 41s —— View job


Code Review

  • Gather context from PR
  • Read changed files
  • Compare against previous review findings
  • Post review feedback

Changes Since Last Review

Both issues from the previous review are now addressed:

  • Comment wording (feature_flags.go:15): now reads // flag 8 is not used by the operator — matches Python
  • Unit tests (pod_hugepages_test.go): added with coverage for ff=nil, flag=false, flag=true+SSDProxy, flag=true+non-SSDProxy, legacy SSDProxy, and 1Gi hugepages

New Issues Found

1. Test reliability bug — GetHugePagesOffset SSDProxy fallback (moderate)

TestGetHugePagesOffset has a case "SSDProxy with no HugepagesOffset uses 200 default" (want: 200), and TestGetHugePagesDetails has the legacy SSDProxy case expecting wantMemory: "1952MiB" — both rely on the offset being 200 for an SSDProxy container with HugepagesOffset=0.

The code path is:

case weka.WekaContainerModeSSDProxy:
    offset = config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB  // 0 in test env

ConfigureEnv() is not called in tests, so SsdProxyHugepagesOffsetMiB is 0 (zero value). There is no fallback to 200 for the SSDProxy case — the default: offset = 200 branch is unreachable here. Both test cases will fail with got=0/got=2152 vs want=200/want=1952.

Two equivalent fixes:

  • Add a zero-guard in GetHugePagesOffset (pod.go:1063): if offset == 0 { offset = 200 } after the SSDProxy assignment, to match ConfigureEnv's default
  • Or add config setup in the test file: a TestMain that sets config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB = 200

Fix this →

2. Upgrade risk — legacy proxy containers not patched (moderate)

ensureProxyContainer() (funcs_proxy.go:63) returns early without update when a proxy CR already exists. Legacy CRs created before this PR have:

  • Hugepages = base + offset (no DPDK memory added)
  • HugepagesOffset = 0

After upgrade, GetHugePagesDetails computes: offset = 200 (fallback) + 2048 (DPDK, since ff=nil/false) = 2248, and --memory = (base + 200) - 2248 = base - 2048. The --memory for those containers drops by 2048 MiB below what's actually allocated in hugepages — since DPDK memory was never added to their spec. This could cause weka to fail on restart.

The test documents this behavior (legacy case, wantMemory: "1952MiB" = 4000 − 2048) but doesn't address it. Either:

  • Add a patch/update path in ensureProxyContainer() to refresh Hugepages and HugepagesOffset when they differ from desired values, or
  • Document this as a known upgrade caveat requiring manual proxy container recreation

What's Good

  • pod_hugepages_test.go is well-structured and covers all the meaningful cases including the legacy container scenario
  • GetHugePagesDetails doc comment now correctly references config.Consts.SsdProxyDpdkMemoryMiB (Copilot's doc mismatch concern from Mar 31 is fixed)
  • The featureFlags field comment (pod.go:71) is clarified — the Mar 31 Copilot suggestion was accepted

Summary: The two issues from the previous review are resolved. Two new moderate issues remain: a test initialization bug that would cause TestGetHugePagesOffset and the legacy TestGetHugePagesDetails case to fail, and the unaddressed upgrade risk for pre-existing proxy containers.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Default: 2048 (2 GiB)
hugepagesOffsetMiB: 2048
# Hugepages offset in MiB kept as a buffer from weka's --memory for SSD proxy containers.
# When null, defaults to 200 MiB (defined in operator code).
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hugepagesOffsetMiB changed semantics (from the previous default 2048 MiB to a “buffer kept out of --memory”, defaulting to 200 MiB). On Helm upgrades, many installs will retain the old stored value (2048) even if users never customized it, which can cause SSD proxy hugepages reservations to jump by an extra 2GiB per node (since DPDK 2048 MiB is now also added separately). It would be safer to add an explicit upgrade note here (or in chart docs/CHANGELOG) warning that existing releases should set this value to null/200 to avoid over-reserving hugepages after upgrade.

Suggested change
# When null, defaults to 200 MiB (defined in operator code).
# When null, defaults to 200 MiB (defined in operator code).
# UPGRADE NOTE: this field previously used different semantics and many Helm upgrades
# may retain an older stored value of 2048 even when it was never explicitly customized.
# That retained value can cause SSD proxy hugepages reservation to increase by an extra
# 2 GiB per node after upgrade. For existing releases, explicitly set this to null or 200
# to use the new default behavior and avoid over-reserving hugepages.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +92
// Legacy container (HugepagesOffset=0, no dpdk in Hugepages spec).
// GetHugePagesOffset falls back to 200; dpdk is still subtracted.
// memory = (hugepagesMiB+200) - (200+2048) = hugepagesMiB - 2048
name: "legacy SSDProxy (HugepagesOffset=0), ff=nil",
container: &weka.WekaContainer{
ObjectMeta: metav1.ObjectMeta{Name: "ssdproxy-legacy"},
Spec: weka.WekaContainerSpec{
Mode: weka.WekaContainerModeSSDProxy,
Hugepages: hugepagesMiB + offsetMiB, // 4200 (no dpdk)
HugepagesOffset: 0,
HugepagesSize: "2Mi",
},
},
ff: nil,
wantMemory: "1952MiB", // 4200 - (200+2048)
},
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assume the SSDProxy default offset is 200 MiB when HugepagesOffset is unset, but GetHugePagesOffset() actually falls back to config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB, which is 0 in unit tests unless config.ConfigureEnv() is run. As written, the “legacy SSDProxy” and offset-fallback cases will fail (offset resolves to 0, changing expected --memory). Set config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB = 200 in the test setup (and restore the previous value with t.Cleanup) or otherwise initialize config for the test.

Copilot uses AI. Check for mistakes.
Comment on lines 118 to 139
@@ -132,7 +134,8 @@ func (r *containerReconcilerLoop) buildProxyContainerSpec(ctx context.Context, c
DriversLoaderImage: cluster.Spec.GetOverrides().DriversLoaderImage,
DriversBuildId: cluster.Spec.GetOverrides().DriversBuildId,
Tolerations: apiutil.ExpandTolerations([]v1.Toleration{}, cluster.Spec.Tolerations, cluster.Spec.RawTolerations),
Hugepages: hugepagesMiB + config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB,
Hugepages: hugepagesMiB + config.Consts.SsdProxyDpdkMemoryMiB + hugepagesOffset,
HugepagesOffset: hugepagesOffset,
HugepagesSize: "2Mi",
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new SSD proxy spec change (adding SsdProxyDpdkMemoryMiB and setting HugepagesOffset) will only apply when the proxy WekaContainer is first created. ensureProxyContainer() currently short-circuits when the proxy already exists, so clusters upgrading from older releases will keep legacy proxy specs (typically HugepagesOffset==0 and Hugepages not including the new DPDK+buffer split), which can lead to incorrect --memory calculations after this PR. Consider patching existing proxy containers to the new Hugepages / HugepagesOffset values (or explicitly detecting legacy specs and preserving the old accounting) so upgrades don’t silently change runtime memory behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will re-create ssdproxy containers in this case

@kristina-solovyova kristina-solovyova force-pushed the 03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_ branch from 72d93be to 829112d Compare April 16, 2026 10:13
Copilot AI review requested due to automatic review settings April 16, 2026 10:22
@kristina-solovyova kristina-solovyova force-pushed the 03-27-fix_include_ssdproxy_unaccounted_dpdk_memory_on_ssd_proxy_includes_dpdk_memory_feature_flag_op-272_ branch from 829112d to 0736bc1 Compare April 16, 2026 10:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +16
// flag 8 is not used by the operator
SsdProxyIncludesDpdkMemory bool `json:"ssd_proxy_includes_dpdk_memory"` // 9
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While adding the new flag, note that FeatureFlags JSON keys must match what weka_runtime.py writes to feature_flags.json. The Python emitter uses allow_per_container_driver_interfaces (plural), but the existing Go field AllowPerContainerDriverInterfaces is tagged as allow_per_container_driver_interface (singular), so it will never be set on unmarshal. Please align the Go json tag with the emitted key (or update the emitter) so the flag works as intended.

Copilot uses AI. Check for mistakes.
return nil, errors.Wrap(err, "failed to calculate hugepages for proxy container")
}

hugepagesOffset := config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildProxyContainerSpec derives hugepagesOffset directly from config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB, but the runtime --memory calculation uses GetHugePagesOffset() (which falls back to 200 when the config value is 0). If the env/config is ever 0 (e.g., tests or a user explicitly sets 0), the created SSD proxy spec and the memory calculation will disagree. Consider reusing the same offset derivation here (e.g., call resources.GetHugePagesOffset() or a shared helper) to keep the spec and accounting consistent.

Suggested change
hugepagesOffset := config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB
hugepagesOffset := resources.GetHugePagesOffset()

Copilot uses AI. Check for mistakes.
.with_secret_variable("GEMINI_API_KEY", gemini_api_key)
.with_secret_variable("GITHUB_TOKEN", gh_token)
.with_env_variable("KUBECONFIG", "/.kube/config")
.with_env_variable("LOG_FORMAT", "raw")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is focused on SSD proxy hugepages/feature-flag behavior, but this hunk also changes the upgrade test container’s logging format (LOG_FORMAT=raw). If this is intentional, it would help to mention it in the PR description; otherwise consider dropping it to keep the PR scoped to the memory accounting change.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants