fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70
Open
PoslavskySV wants to merge 2 commits intomainfrom
Open
fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70PoslavskySV wants to merge 2 commits intomainfrom
PoslavskySV wants to merge 2 commits intomainfrom
Conversation
…Container
The pre-install Job (hook-chown-workspace.yaml) mounted the workspace
PVC to run chown, but the PVC is rendered as a regular manifest — not
a pre-install hook — so on fresh install the hook Job fires BEFORE the
PVC exists. The Job pod stays Pending forever, helm install times out,
deployment never happens.
This is a latent bug that affects every workspace mode that triggers the
hook (filestore, nfs, fsxLustre, and pvc with chownOnCreate). EFS users
on AWS never hit it because EFS Access Points handle UID/GID at the
file-system level so the hook is skipped. GCP Filestore and generic NFS
installs from a clean cluster currently cannot succeed without manual
intervention.
Fix: move the chown into a Deployment-level initContainer, gated on the
same conditional as the old hook. The initContainer runs after the PVC
is bound to the pod, so the timing is correct by construction. Trade-off
is a chown -R at every pod restart; this is a no-op after the first run
(ownership already matches) and has negligible cost compared to Platforma
startup.
Reproduction (before this fix):
helm install platforma oci://ghcr.io/milaboratory/platforma-helm/platforma
--set environment=gcp
--set storage.workspace.filestore.enabled=true
--set storage.workspace.filestore.{instanceName,location,shareName,ip}=...
--set storage.main.type=gcs
...
# -> hook Job "platforma-chown-workspace" stuck Pending indefinitely
# -> kubectl describe shows: "persistentvolumeclaim 'platforma-workspace'
# not found. not found"
# -> helm rollout timeout after 10min; release left in pending-install
Tested on:
- GCP GKE Standard, Filestore workspace: initContainer runs at pod start,
chown succeeds, Platforma becomes Ready.
- Review needed for FSx Lustre (same conditional path) — no AWS test cluster
available during this change.
…nary
PR #1700 renamed these flags in the chart but the Platforma binary at
the pinned appVersion 3.3.0 still expects the old names. Chart-rendered
manifests fail on startup with:
unknown flag `runner-max-cpu-request'
Invalid command line options
Reverting the chart to use --k8s-max-{cpu,ram}-request so `helm install`
against the 3.3.0 image succeeds.
This is a coordination issue: chart and binary need to be versioned
together. Options for chart maintainers:
(a) keep this revert until a binary that supports --runner-max-*
lands, then bump appVersion AND rename flags in the chart in the
same commit.
(b) dual-support in the binary (accept both old and new names during
a migration window), then rename in the chart.
(c) release a new binary with --runner-max-* as a minor version and
forward.
Leaving the decision to maintainers; this patch unblocks 3.3.0 installs
in the meantime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related chart-install bugs discovered while building the GCP installer, both affecting fresh installs:
Pre-install chown-workspace hook deadlocks on clean install (commit 1).
The hook Job mounts `platforma-workspace` PVC, but the PVC is rendered as a
regular manifest, not a hook — so on fresh install the hook runs BEFORE the
PVC exists. Pod stays `Pending` forever, helm times out, release left in
`pending-install`. Affects every workspace mode that triggers the hook:
`filestore`, `nfs`, `fsxLustre`, and `pvc` with `chownOnCreate=true`.
EFS users on AWS don't hit it (EFS Access Points skip the hook).
Fix: move chown into a Deployment initContainer gated on the same
conditional. Timing is correct by construction — PVC is already bound
when the pod starts.
`--runner-max-*` flags break 3.3.0 binary (commit 2).
PR #1700 renamed `--k8s-max-{cpu,ram}-request` to `--runner-max-{cpu,ram}-request`
in the chart, but the 3.3.0 binary (the pinned appVersion) still expects the
old names. Chart-rendered manifests fail on startup with `unknown flag`.
Fix: revert chart to use `--k8s-max-*` so 3.3.0 installs work.
Why separate commits
Different concerns, independently reviewable. Commit 1 is a genuine refactor
(hook → initContainer); commit 2 is a coordination fix with chart/binary
versioning that maintainers may want to handle differently (see commit
message for options).
Test plan
workspace mode; confirmed install fails with PVC-not-found in hook.
pod start, chown completes, Platforma Ready.
flags and starts normally against 3.3.0 image.
as Filestore, review recommended.
Context
Discovered during the GCP Infrastructure Manager spike (separate PR TBD).
Without these fixes, the GCP installer (and any non-EFS AWS installer) cannot
succeed on a fresh cluster. Opening this separately so chart review can start
in parallel with the GCP infrastructure work.