fix: include ssdproxy unaccounted DPDK memory on ssd_proxy_includes_dpdk_memory feature flag (OP-272)#2405
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
How to use the Graphite Merge QueueAdd the label main-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has required the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Graphite Automations"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (03/27/26)2 reviewers were added to this PR based on Anton Bykov's automation. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new image feature flag to control how SSDProxy hugepages memory is translated into the --memory/MEMORY value, so that (when enabled) SSDProxy includes previously unaccounted DPDK memory.
Changes:
- Add
ssd_proxy_includes_dpdk_memoryto the image feature flags model. - Plumb feature flags into
resources.PodFactoryandensurePod, and use them when computing hugepages memory details. - Update
NewPodFactorycall sites in tests to match the new signature.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/domain/feature_flags.go | Adds the new feature flag field to the serialized flags struct. |
| internal/controllers/wekacontainer/funcs_pod_ensure.go | Fetches feature flags (except for the feature-flags ad-hoc mode) and passes them into PodFactory. |
| internal/controllers/resources/pod.go | Extends PodFactory to accept optional feature flags and adjusts hugepages memory calculation accordingly. |
| internal/controllers/wekacontainer/funcs_pod_ensure_test.go | Updates tests for the new NewPodFactory signature. |
| internal/controllers/operations/load_drivers_test.go | Updates tests for the new NewPodFactory signature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
939c9c8 to
d562610
Compare
|
Claude finished @kristina-solovyova's task in 2m 9s —— View job Code Review
The implementation is clean and correct. A few notes below. Correctness of the Copilot commentsIndex numbering comment ( Mode check ( if ff != nil && ff.SsdProxyIncludesDpdkMemory && container.Spec.Mode == weka.WekaContainerModeSSDProxy {This comment is stale/wrong. Feature flag fetching for all modes ( if container.Spec.Mode == weka.WekaContainerModeSSDProxy {
ff, err = r.GetFeatureFlags(ctx)
...
}All three Copilot comments are addressed or incorrect. Issues Found1. Inconsistent comment wording between Go and Python (minor)
// flag 8 is not yet added
# flag 8 is not used by the operator"Not yet added" implies a future addition, while "not used by the operator" is a factual statement about the current state. The Python wording is more accurate since flag 8 exists in the spec but the operator simply doesn't need it. Suggest aligning the Go comment to match: 2. No unit tests for the new memory calculation path (moderate) There are no test files in What's Good
Summary: Two concerns — a trivial comment wording inconsistency and the absence of unit tests for the new memory calculation logic. The core implementation is correct and the Copilot review comments are either already addressed or wrong. |
d562610 to
c186afd
Compare
|
Claude finished @kristina-solovyova's task in 39s —— View job No code changes since the previous review on 2026-03-27. The branch tip is still Previous review findings remain open:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3eae9ca to
72d93be
Compare
|
Claude finished @kristina-solovyova's task in 4m 41s —— View job Code Review
Changes Since Last ReviewBoth issues from the previous review are now addressed:
New Issues Found1. Test reliability bug —
The code path is: case weka.WekaContainerModeSSDProxy:
offset = config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB // 0 in test env
Two equivalent fixes:
2. Upgrade risk — legacy proxy containers not patched (moderate)
After upgrade, The test documents this behavior (legacy case,
What's Good
Summary: The two issues from the previous review are resolved. Two new moderate issues remain: a test initialization bug that would cause |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Default: 2048 (2 GiB) | ||
| hugepagesOffsetMiB: 2048 | ||
| # Hugepages offset in MiB kept as a buffer from weka's --memory for SSD proxy containers. | ||
| # When null, defaults to 200 MiB (defined in operator code). |
There was a problem hiding this comment.
hugepagesOffsetMiB changed semantics (from the previous default 2048 MiB to a “buffer kept out of --memory”, defaulting to 200 MiB). On Helm upgrades, many installs will retain the old stored value (2048) even if users never customized it, which can cause SSD proxy hugepages reservations to jump by an extra 2GiB per node (since DPDK 2048 MiB is now also added separately). It would be safer to add an explicit upgrade note here (or in chart docs/CHANGELOG) warning that existing releases should set this value to null/200 to avoid over-reserving hugepages after upgrade.
| # When null, defaults to 200 MiB (defined in operator code). | |
| # When null, defaults to 200 MiB (defined in operator code). | |
| # UPGRADE NOTE: this field previously used different semantics and many Helm upgrades | |
| # may retain an older stored value of 2048 even when it was never explicitly customized. | |
| # That retained value can cause SSD proxy hugepages reservation to increase by an extra | |
| # 2 GiB per node after upgrade. For existing releases, explicitly set this to null or 200 | |
| # to use the new default behavior and avoid over-reserving hugepages. |
| // Legacy container (HugepagesOffset=0, no dpdk in Hugepages spec). | ||
| // GetHugePagesOffset falls back to 200; dpdk is still subtracted. | ||
| // memory = (hugepagesMiB+200) - (200+2048) = hugepagesMiB - 2048 | ||
| name: "legacy SSDProxy (HugepagesOffset=0), ff=nil", | ||
| container: &weka.WekaContainer{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: "ssdproxy-legacy"}, | ||
| Spec: weka.WekaContainerSpec{ | ||
| Mode: weka.WekaContainerModeSSDProxy, | ||
| Hugepages: hugepagesMiB + offsetMiB, // 4200 (no dpdk) | ||
| HugepagesOffset: 0, | ||
| HugepagesSize: "2Mi", | ||
| }, | ||
| }, | ||
| ff: nil, | ||
| wantMemory: "1952MiB", // 4200 - (200+2048) | ||
| }, |
There was a problem hiding this comment.
These tests assume the SSDProxy default offset is 200 MiB when HugepagesOffset is unset, but GetHugePagesOffset() actually falls back to config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB, which is 0 in unit tests unless config.ConfigureEnv() is run. As written, the “legacy SSDProxy” and offset-fallback cases will fail (offset resolves to 0, changing expected --memory). Set config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB = 200 in the test setup (and restore the previous value with t.Cleanup) or otherwise initialize config for the test.
| @@ -132,7 +134,8 @@ func (r *containerReconcilerLoop) buildProxyContainerSpec(ctx context.Context, c | |||
| DriversLoaderImage: cluster.Spec.GetOverrides().DriversLoaderImage, | |||
| DriversBuildId: cluster.Spec.GetOverrides().DriversBuildId, | |||
| Tolerations: apiutil.ExpandTolerations([]v1.Toleration{}, cluster.Spec.Tolerations, cluster.Spec.RawTolerations), | |||
| Hugepages: hugepagesMiB + config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB, | |||
| Hugepages: hugepagesMiB + config.Consts.SsdProxyDpdkMemoryMiB + hugepagesOffset, | |||
| HugepagesOffset: hugepagesOffset, | |||
| HugepagesSize: "2Mi", | |||
There was a problem hiding this comment.
This new SSD proxy spec change (adding SsdProxyDpdkMemoryMiB and setting HugepagesOffset) will only apply when the proxy WekaContainer is first created. ensureProxyContainer() currently short-circuits when the proxy already exists, so clusters upgrading from older releases will keep legacy proxy specs (typically HugepagesOffset==0 and Hugepages not including the new DPDK+buffer split), which can lead to incorrect --memory calculations after this PR. Consider patching existing proxy containers to the new Hugepages / HugepagesOffset values (or explicitly detecting legacy specs and preserving the old accounting) so upgrades don’t silently change runtime memory behavior.
There was a problem hiding this comment.
we will re-create ssdproxy containers in this case
72d93be to
829112d
Compare
829112d to
0736bc1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // flag 8 is not used by the operator | ||
| SsdProxyIncludesDpdkMemory bool `json:"ssd_proxy_includes_dpdk_memory"` // 9 |
There was a problem hiding this comment.
While adding the new flag, note that FeatureFlags JSON keys must match what weka_runtime.py writes to feature_flags.json. The Python emitter uses allow_per_container_driver_interfaces (plural), but the existing Go field AllowPerContainerDriverInterfaces is tagged as allow_per_container_driver_interface (singular), so it will never be set on unmarshal. Please align the Go json tag with the emitted key (or update the emitter) so the flag works as intended.
| return nil, errors.Wrap(err, "failed to calculate hugepages for proxy container") | ||
| } | ||
|
|
||
| hugepagesOffset := config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB |
There was a problem hiding this comment.
buildProxyContainerSpec derives hugepagesOffset directly from config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB, but the runtime --memory calculation uses GetHugePagesOffset() (which falls back to 200 when the config value is 0). If the env/config is ever 0 (e.g., tests or a user explicitly sets 0), the created SSD proxy spec and the memory calculation will disagree. Consider reusing the same offset derivation here (e.g., call resources.GetHugePagesOffset() or a shared helper) to keep the spec and accounting consistent.
| hugepagesOffset := config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB | |
| hugepagesOffset := resources.GetHugePagesOffset() |
| .with_secret_variable("GEMINI_API_KEY", gemini_api_key) | ||
| .with_secret_variable("GITHUB_TOKEN", gh_token) | ||
| .with_env_variable("KUBECONFIG", "/.kube/config") | ||
| .with_env_variable("LOG_FORMAT", "raw") |
There was a problem hiding this comment.
This PR is focused on SSD proxy hugepages/feature-flag behavior, but this hunk also changes the upgrade test container’s logging format (LOG_FORMAT=raw). If this is intentional, it would help to mention it in the PR description; otherwise consider dropping it to keep the PR scoped to the memory accounting change.

This change introduces a new feature flag
SsdProxyIncludesDpdkMemorythat modifies how hugepages memory is calculated for containers. When enabled, the full hugepages amount is passed to the--memoryparameter instead of subtracting the DPDK offset, as the SSD proxy now accounts for DPDK memory internally.The
PodFactoryconstructor now accepts an optionalFeatureFlagsparameter, which is passed through toGetHugePagesDetails()to determine the appropriate memory calculation behavior. For ad-hoc containers used to fetch feature flags,nilis passed to maintain backward compatibility.