Skip to content

fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70

Open
PoslavskySV wants to merge 2 commits intomainfrom
chart/install-fixes
Open

fix: chart install failures on Filestore/NFS/FSx and --runner-max flag mismatch#70
PoslavskySV wants to merge 2 commits intomainfrom
chart/install-fixes

Conversation

@PoslavskySV
Copy link
Copy Markdown
Member

Summary

Two related chart-install bugs discovered while building the GCP installer, both affecting fresh installs:

  1. Pre-install chown-workspace hook deadlocks on clean install (commit 1).
    The hook Job mounts `platforma-workspace` PVC, but the PVC is rendered as a
    regular manifest, not a hook — so on fresh install the hook runs BEFORE the
    PVC exists. Pod stays `Pending` forever, helm times out, release left in
    `pending-install`. Affects every workspace mode that triggers the hook:
    `filestore`, `nfs`, `fsxLustre`, and `pvc` with `chownOnCreate=true`.
    EFS users on AWS don't hit it (EFS Access Points skip the hook).

    Fix: move chown into a Deployment initContainer gated on the same
    conditional. Timing is correct by construction — PVC is already bound
    when the pod starts.

  2. `--runner-max-*` flags break 3.3.0 binary (commit 2).
    PR #1700 renamed `--k8s-max-{cpu,ram}-request` to `--runner-max-{cpu,ram}-request`
    in the chart, but the 3.3.0 binary (the pinned appVersion) still expects the
    old names. Chart-rendered manifests fail on startup with `unknown flag`.

    Fix: revert chart to use `--k8s-max-*` so 3.3.0 installs work.

Why separate commits

Different concerns, independently reviewable. Commit 1 is a genuine refactor
(hook → initContainer); commit 2 is a coordination fix with chart/binary
versioning that maintainers may want to handle differently (see commit
message for options).

Test plan

  • Reproduced the chown deadlock on GCP GKE Standard with Filestore
    workspace mode; confirmed install fails with PVC-not-found in hook.
  • Applied commit 1; confirmed install succeeds, initContainer runs at
    pod start, chown completes, Platforma Ready.
  • Applied commit 2; confirmed Platforma binary accepts `--k8s-max-*`
    flags and starts normally against 3.3.0 image.
  • FSx Lustre path not tested (no AWS test cluster) — same conditional
    as Filestore, review recommended.
  • Generic NFS / PVC-with-chownOnCreate paths not tested — same conditional.

Context

Discovered during the GCP Infrastructure Manager spike (separate PR TBD).
Without these fixes, the GCP installer (and any non-EFS AWS installer) cannot
succeed on a fresh cluster. Opening this separately so chart review can start
in parallel with the GCP infrastructure work.

…Container

The pre-install Job (hook-chown-workspace.yaml) mounted the workspace
PVC to run chown, but the PVC is rendered as a regular manifest — not
a pre-install hook — so on fresh install the hook Job fires BEFORE the
PVC exists. The Job pod stays Pending forever, helm install times out,
deployment never happens.

This is a latent bug that affects every workspace mode that triggers the
hook (filestore, nfs, fsxLustre, and pvc with chownOnCreate). EFS users
on AWS never hit it because EFS Access Points handle UID/GID at the
file-system level so the hook is skipped. GCP Filestore and generic NFS
installs from a clean cluster currently cannot succeed without manual
intervention.

Fix: move the chown into a Deployment-level initContainer, gated on the
same conditional as the old hook. The initContainer runs after the PVC
is bound to the pod, so the timing is correct by construction. Trade-off
is a chown -R at every pod restart; this is a no-op after the first run
(ownership already matches) and has negligible cost compared to Platforma
startup.

Reproduction (before this fix):
  helm install platforma oci://ghcr.io/milaboratory/platforma-helm/platforma
    --set environment=gcp
    --set storage.workspace.filestore.enabled=true
    --set storage.workspace.filestore.{instanceName,location,shareName,ip}=...
    --set storage.main.type=gcs
    ...
  # -> hook Job "platforma-chown-workspace" stuck Pending indefinitely
  # -> kubectl describe shows: "persistentvolumeclaim 'platforma-workspace'
  #    not found. not found"
  # -> helm rollout timeout after 10min; release left in pending-install

Tested on:
  - GCP GKE Standard, Filestore workspace: initContainer runs at pod start,
    chown succeeds, Platforma becomes Ready.
  - Review needed for FSx Lustre (same conditional path) — no AWS test cluster
    available during this change.
…nary

PR #1700 renamed these flags in the chart but the Platforma binary at
the pinned appVersion 3.3.0 still expects the old names. Chart-rendered
manifests fail on startup with:

    unknown flag `runner-max-cpu-request'
    Invalid command line options

Reverting the chart to use --k8s-max-{cpu,ram}-request so `helm install`
against the 3.3.0 image succeeds.

This is a coordination issue: chart and binary need to be versioned
together. Options for chart maintainers:
  (a) keep this revert until a binary that supports --runner-max-*
      lands, then bump appVersion AND rename flags in the chart in the
      same commit.
  (b) dual-support in the binary (accept both old and new names during
      a migration window), then rename in the chart.
  (c) release a new binary with --runner-max-* as a minor version and
      forward.
Leaving the decision to maintainers; this patch unblocks 3.3.0 installs
in the meantime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant