fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch by PoslavskySV · Pull Request #70 · milaboratory/platforma-helm

PoslavskySV · 2026-04-24T18:43:18Z

Summary

Two related chart-install bugs discovered while building the GCP installer, both affecting fresh installs:

Pre-install chown-workspace hook deadlocks on clean install (commit 1).
The hook Job mounts `platforma-workspace` PVC, but the PVC is rendered as a
regular manifest, not a hook — so on fresh install the hook runs BEFORE the
PVC exists. Pod stays `Pending` forever, helm times out, release left in
`pending-install`. Affects every workspace mode that triggers the hook:
`filestore`, `nfs`, `fsxLustre`, and `pvc` with `chownOnCreate=true`.
EFS users on AWS don't hit it (EFS Access Points skip the hook).

Fix: move chown into a Deployment initContainer gated on the same
conditional. Timing is correct by construction — PVC is already bound
when the pod starts.
`--runner-max-*` flags break 3.3.0 binary (commit 2).
PR #1700 renamed `--k8s-max-{cpu,ram}-request` to `--runner-max-{cpu,ram}-request`
in the chart, but the 3.3.0 binary (the pinned appVersion) still expects the
old names. Chart-rendered manifests fail on startup with `unknown flag`.

Fix: revert chart to use `--k8s-max-*` so 3.3.0 installs work.

Why separate commits

Different concerns, independently reviewable. Commit 1 is a genuine refactor
(hook → initContainer); commit 2 is a coordination fix with chart/binary
versioning that maintainers may want to handle differently (see commit
message for options).

Test plan

Reproduced the chown deadlock on GCP GKE Standard with Filestore
workspace mode; confirmed install fails with PVC-not-found in hook.
Applied commit 1; confirmed install succeeds, initContainer runs at
pod start, chown completes, Platforma Ready.
Applied commit 2; confirmed Platforma binary accepts `--k8s-max-*`
flags and starts normally against 3.3.0 image.
FSx Lustre path not tested (no AWS test cluster) — same conditional
as Filestore, review recommended.
Generic NFS / PVC-with-chownOnCreate paths not tested — same conditional.

Context

Discovered during the GCP Infrastructure Manager spike (separate PR TBD).
Without these fixes, the GCP installer (and any non-EFS AWS installer) cannot
succeed on a fresh cluster. Opening this separately so chart review can start
in parallel with the GCP infrastructure work.

…Container The pre-install Job (hook-chown-workspace.yaml) mounted the workspace PVC to run chown, but the PVC is rendered as a regular manifest — not a pre-install hook — so on fresh install the hook Job fires BEFORE the PVC exists. The Job pod stays Pending forever, helm install times out, deployment never happens. This is a latent bug that affects every workspace mode that triggers the hook (filestore, nfs, fsxLustre, and pvc with chownOnCreate). EFS users on AWS never hit it because EFS Access Points handle UID/GID at the file-system level so the hook is skipped. GCP Filestore and generic NFS installs from a clean cluster currently cannot succeed without manual intervention. Fix: move the chown into a Deployment-level initContainer, gated on the same conditional as the old hook. The initContainer runs after the PVC is bound to the pod, so the timing is correct by construction. Trade-off is a chown -R at every pod restart; this is a no-op after the first run (ownership already matches) and has negligible cost compared to Platforma startup. Reproduction (before this fix): helm install platforma oci://ghcr.io/milaboratory/platforma-helm/platforma --set environment=gcp --set storage.workspace.filestore.enabled=true --set storage.workspace.filestore.{instanceName,location,shareName,ip}=... --set storage.main.type=gcs ... # -> hook Job "platforma-chown-workspace" stuck Pending indefinitely # -> kubectl describe shows: "persistentvolumeclaim 'platforma-workspace' # not found. not found" # -> helm rollout timeout after 10min; release left in pending-install Tested on: - GCP GKE Standard, Filestore workspace: initContainer runs at pod start, chown succeeds, Platforma becomes Ready. - Review needed for FSx Lustre (same conditional path) — no AWS test cluster available during this change.

…nary PR #1700 renamed these flags in the chart but the Platforma binary at the pinned appVersion 3.3.0 still expects the old names. Chart-rendered manifests fail on startup with: unknown flag `runner-max-cpu-request' Invalid command line options Reverting the chart to use --k8s-max-{cpu,ram}-request so `helm install` against the 3.3.0 image succeeds. This is a coordination issue: chart and binary need to be versioned together. Options for chart maintainers: (a) keep this revert until a binary that supports --runner-max-* lands, then bump appVersion AND rename flags in the chart in the same commit. (b) dual-support in the binary (accept both old and new names during a migration window), then rename in the chart. (c) release a new binary with --runner-max-* as a minor version and forward. Leaving the decision to maintainers; this patch unblocks 3.3.0 installs in the meantime.

PoslavskySV added 2 commits April 24, 2026 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70

fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70
PoslavskySV wants to merge 2 commits intomainfrom
chart/install-fixes

PoslavskySV commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PoslavskySV commented Apr 24, 2026

Summary

Why separate commits

Test plan

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant