Summary
Today the operator orchestrates package execution as raw Pods (createPodFromPackage in operator/internal/controller/skyhook_controller.go, watched by the pseudo-controller in pod_controller.go). Switch to batch/v1 Jobs, with a 1-completion 1-parallelism Job per (skyhook, package, stage, node). The Job controller absorbs lifecycle concerns we currently hand-roll (cleanup, completion tracking, retry semantics, pod history retention) and unlocks capabilities the raw-Pod model can't express cleanly (podFailurePolicy, suspend, kubectl logs job/<name>).
This is an architectural change but bounded: no CRD/schema change, control-plane swap only.
Why Jobs
UX wins
kubectl logs job/<name> works across pod restarts and recreations. Today, debugging a failed stage requires identifying the right per-stage Pod name before it's GC'd; with Jobs, the Job name is stable and kubectl logs resolves to the most recent (or all) child Pods.
kubectl describe job gives reviewers a built-in completion/failure summary including the active/succeeded/failed pod count and the conditions timeline. We currently surface the same info via custom annotations + per-package CR fields.
- Failed Pods linger up to
backoffLimit instead of being cleaned up, so a developer or support engineer arriving 30 minutes after a failure still has the logs.
- Reduced custom code:
ValidateRunningPackages, the manual delete logic at skyhook_controller.go:~2365, and parts of pod_controller.go get simpler or disappear.
Capabilities unlocked
| Capability |
Jobs feature |
Outcome |
Don't retry on ImagePullBackOff / ErrImagePull |
podFailurePolicy (k8s 1.31+ stable) — needs verification, see note below |
Potentially fixes a known operator gap: when a package's container image is missing or unpullable, the package pod retries indefinitely instead of surfacing as state: erroring. Today this needs a custom event-watch on package pods; with podFailurePolicy the rule is declarative. |
| Auto-cleanup of completed Jobs |
ttlSecondsAfterFinished |
Stop carrying a "delete the package pod when stage is done" branch. |
| Bounded retries with operator override |
backoffLimit: 0 |
Operator continues to drive retry decisions; Jobs just record what happened. |
| OOM-handled retry semantics |
restartPolicy: OnFailure inside the JobTemplate |
Same primitive as today, but Jobs surface the retry count cleanly in .status.failed. Adjacent to a known prior failure mode where a package pod that OOMed was not always rescheduled. |
| Per-stage hard timeout (new capability we don't have today) |
activeDeadlineSeconds |
Wall-clock guard. Today a hung package pod runs indefinitely; with Jobs we can declare "this stage must finish in N seconds or it's Failed and the operator surfaces it as state: erroring." |
| Avoid zombie-old + new-pod overlap on retry |
PodReplacementPolicy: Failed (k8s 1.28+ stable) |
Replacement pod is only created after the failed pod has fully terminated. Particularly relevant for our hostPath mounts at /skyhook-package and the host root mount, where two pods racing for the same host paths is asking for trouble. |
| Cleaner OOMKilled status |
.status.failed + OOM reason in Job conditions |
OOM detection today requires reading container lastState.terminated.reason. With Jobs, the failure surface is structured. |
| Ecosystem alignment |
First-class in k9s, ArgoCD health, Kueue, Tekton, kubectl top per-Job |
Run-to-completion work is the standard Jobs use case; tooling treats it as a first-class concept. Today our raw-Pod orchestration is second-class in most dashboards' "is this succeeding?" views. |
Note on podFailurePolicy for ImagePullBackOff: The policy matches on OnExitCodes (containers that ran) and OnPodConditions (pod-level conditions like DisruptionTarget). ImagePullBackOff manifests as a container status reason on a pod that never reached Running, so whether podFailurePolicy actually catches it cleanly needs verification before we rely on it as the fix. If it doesn't, the operator-side approach — watch package-pod events for Failed/BackOff reasons and mark the node state: erroring — is still the path, regardless of whether the underlying resource is a Pod or a Job.
On spec.suspend: considered and rejected as a pause/resume primitive. Suspending a Job with a running pod terminates the pod (SIGTERM) and recreates from scratch on resume — no checkpointing. The existing skyhook.nvidia.com/pause annotation continues to be the right pause mechanism (block new stage scheduling, let in-flight stages finish naturally).
Proposed shape
- One Job per (skyhook, package, stage, node) with
parallelism: 1, completions: 1, backoffLimit: 0, ttlSecondsAfterFinished: 86400. Same lifecycle granularity as today's pods.
- PodTemplate carries the
NodeName: <node> we set today, plus tolerations/init container/main container — essentially the body of createPodFromPackage lifted into a JobTemplate.
backoffLimit: 0 means a failed Pod fails the Job and the operator decides if/when to recreate (matches today's behavior; we don't want the Job controller silently retrying twice while reconcile is computing what to do next).
- TTL of 24h on success means logs stick around for support without growing forever; failed Jobs stay until the operator (or human) deletes them.
- Watch shape: register a Jobs watch in addition to (or replacing) the current Pods watch in
cluster_state_v2.go's setup. Child-Pod-level events still inform node state transitions for granular UX, but the Job is the unit of completion truth.
Pod controller: keep the pseudo-pattern
pod_controller.go is structured as a pseudo-controller today — events from the Pod watch are mapped into the main SkyhookReconciler queue rather than driving a separate reconciler. The comment in the file is honest about the reason: race conditions. The SkyhookReconciler reconcile loop watches three kinds (Skyhook, Node, Pod/Job) and starts by grabbing the world — reading all relevant SCRs, nodes, and child resources before computing desired state and writing back. If a separate JobReconciler also did read-modify-write against SCR status / node annotations, the two reconcilers would step on each other regardless of per-controller concurrency limits (MaxConcurrentReconciles: 1 only serializes one controller's workqueue, not across controllers). Routing everything through one SCR-keyed queue is what gives us global serialization for that SCR.
Concretely:
| Option |
Shape |
Reality |
| A. Pseudo-pattern (current) |
Job watch → mapper → SkyhookReconciler queue |
Preserves single-key serialization; the pattern that already works. |
B. Real JobReconciler whose Reconcile only enqueues SCR keys |
Watch + map, no SCR-state writes |
Same race property as A; just spreads the same logic across two files. Marginal cleanup at best. |
C. Real JobReconciler with cross-controller serialization |
Per-SCR mutex / leader-style coordination |
Real separation of concerns, but adds a synchronization primitive that has to be designed, reasoned about, and tested. |
Recommendation: keep the pseudo-pattern (A) for this migration. The split-controller question is real but not made any easier by switching from Pods to Jobs — it's its own design pass that should land separately if/when desired.
Linking related Jobs together
A node going through one full rollout creates many Jobs (uninstall → apply → config → interrupt → post-interrupt, one per package). Reviewers will want to see "these Jobs are all part of one operation." K8s has no native "linked Jobs" primitive (JobSet exists but is built for coordinated parallel ML training — wrong shape for sequential per-stage lifecycle). The practical pattern is owner references + labels:
- OwnerReferences on every Job point to the Skyhook CR (same as Pods today). Cascade delete + provenance lookup come for free.
- Labels for query-ability:
skyhook.nvidia.com/name=<scr> — already used on pods, keep on Jobs
skyhook.nvidia.com/package=<package> — already, keep
skyhook.nvidia.com/node=<node> — already, keep
skyhook.nvidia.com/stage=<stage> — new
skyhook.nvidia.com/resource-id=<SKYHOOK_RESOURCE_ID> — promoted from agent env to Job label. Already a per-package-config unique ID; using it as a label ties every stage Job for one rollout of one package on one node into a single queryable group.
Resulting UX:
# all Jobs in one SCR
kubectl get jobs -l skyhook.nvidia.com/name=gpu-init
# one rollout on one node — uninstall, apply, config, interrupt, post-interrupt
kubectl get jobs -l skyhook.nvidia.com/resource-id=<id>
# every Job touching a single node across SCRs
kubectl get jobs -l skyhook.nvidia.com/node=worker-7
kubectl skyhook node status already renders a friendlier rolled-up view of the same data and continues to work — it reads from node annotations, which the operator updates from Job conditions instead of Pod conditions, but the rendered output is unchanged.
Files affected (high level)
operator/internal/controller/skyhook_controller.go — createPodFromPackage → createJobFromPackage; ValidateRunningPackages and other delete-pod call sites updated to operate on Jobs (with cascading delete pulling child Pods).
operator/internal/controller/pod_controller.go — pseudo-controller becomes a Job-aware reconciler (or a thin wrapper that maps both Job-level and Pod-level events into node-state updates).
operator/internal/controller/cluster_state_v2.go — package-pod lookups → Job lookups (Job → child Pods through controller-uid label as needed).
operator/internal/wrapper/{skyhook,node}.go — places that look up pod-by-stage.
operator/api/v1alpha1/zz.migration.X.Y.Z.go — new one-shot migration shim that handles in-flight raw Pods present at upgrade time (delete or let-finish + recreate-as-Job on next reconcile).
- RBAC: add
batch/jobs verbs (get;list;watch;create;update;patch;delete) to operator ClusterRole; can drop or keep pod RBAC depending on whether we still want direct pod ops for cordon/drain edge cases.
- Tests across the above (Ginkgo specs).
docs/designs/<date>-package-execution-as-jobs.md — design doc capturing the choices above so reviewers see the rationale before the PR; include in the same series of PRs.
Migration
Operator upgrade lands and finds pre-existing raw Pods from the prior version. Two viable paths:
- Let in-flight Pods finish, only create Jobs for new stages. Lower risk, but the operator has to keep the pod-level reconcile path alive for one minor.
- Proactively delete in-flight Pods, recreate as Jobs. Simpler code (remove the dual path) but interrupts an in-flight rollout — reasonable if
interruptionBudget is honored on recreate.
Recommend (1) for the version that introduces Jobs and a follow-up minor that drops the dual path. The migration shim lives in zz.migration.<version>.go per the project's convention; it's a labeled ConfigMap watch / boot-time scan, not a long-running compatibility shim.
Acceptance criteria
- All package execution goes through
batch/v1 Jobs on a fresh install. kubectl get jobs -n skyhook shows one Job per (SCR, package, stage, node) currently in flight.
kubectl logs job/<job-name> returns the running stage's logs without the user needing to know the Pod name.
- A package whose image is unpullable (
ImagePullBackOff / ErrImagePull) does not retry indefinitely. The Job goes to Failed, the operator surfaces it as state: erroring. Path: either podFailurePolicy declarative rule, or operator-side event watch — whichever works cleanly in testing.
- Existing chainsaw e2e (
make e2e-tests) and helm e2e (make helm-tests) suites are green; in particular the cordon/drain/interrupt sequencing still works.
- Successful Jobs auto-clean after
ttlSecondsAfterFinished; failed Jobs stick until operator (or user) intervention.
- An upgrade from a prior raw-Pod version to this Jobs version completes without leaving orphaned package Pods (the migration shim is exercised in a chainsaw upgrade test).
- New design doc under
docs/designs/.
Implications for other tickets
- Possibly addresses the ImagePullBackOff retry gap — today, when a package's image is unpullable, the pod retries indefinitely instead of being surfaced as a hard failure.
podFailurePolicy might handle this declaratively; needs verification of whether ImagePullBackOff manifests as a matchable Pod condition or container status. If yes, the gap closes here; if not, the operator-side event-watch fix remains its own follow-up.
- Likely supersedes the "kubectl-skyhook debug collect" idea — collecting CR yaml + node annotations + pod logs + events into a tarball is much cheaper once it's
kubectl logs job/<n> + kubectl describe job/<n> + kubectl get events. Worth deferring any custom debug-bundle work until/unless a real pain pattern surfaces post-Jobs.
Out of scope
-
CRD changes. Stage / state / package model is unchanged.
-
Replacing the cordon/drain code path. Cordon/drain remains node-level orchestration owned by the operator; Jobs only host the per-stage execution.
-
Job-level parallelism (running the same package stage on multiple nodes from one Job). Sticking with one Job per (skyhook, package, stage, node) keeps lifecycle reasoning unchanged.
-
Replacing the Pod controller entirely. A thin Pod-level watch may still be useful for granular state transitions during long-running stages.
-
DeploymentPolicy logic is not affected. DeploymentPolicy controls which nodes are eligible to run and how many concurrently (compartments, budgets, fixed/linear/exponential strategies), and includes substantial cross-node failure handling — failureThreshold (consecutive failures before halting), safetyLimit, batchThreshold, strategy-specific slowdown on failure, and status fields like ConsecutiveFailures, FailedNodes, ShouldStop. K8s has no primitive that replaces this role (PDB is the closest but solves disruption, not rollout shaping). The reconcile loop that consults DeploymentPolicy to pick the next eligible node and to decide whether to halt continues unchanged.
The Jobs migration does clean up the signal source that feeds DeploymentPolicy: today the "node failed" signal is computed in pod_controller.go from pod phase + container statuses + restartCount; after the migration it's the Job's Failed=True condition — single source of truth, easier to reason about. DeploymentPolicy still owns what to do with that signal across the rollout.
References
Summary
Today the operator orchestrates package execution as raw Pods (
createPodFromPackageinoperator/internal/controller/skyhook_controller.go, watched by the pseudo-controller inpod_controller.go). Switch tobatch/v1Jobs, with a 1-completion 1-parallelism Job per (skyhook, package, stage, node). The Job controller absorbs lifecycle concerns we currently hand-roll (cleanup, completion tracking, retry semantics, pod history retention) and unlocks capabilities the raw-Pod model can't express cleanly (podFailurePolicy,suspend,kubectl logs job/<name>).This is an architectural change but bounded: no CRD/schema change, control-plane swap only.
Why Jobs
UX wins
kubectl logs job/<name>works across pod restarts and recreations. Today, debugging a failed stage requires identifying the right per-stage Pod name before it's GC'd; with Jobs, the Job name is stable andkubectl logsresolves to the most recent (or all) child Pods.kubectl describe jobgives reviewers a built-in completion/failure summary including the active/succeeded/failed pod count and the conditions timeline. We currently surface the same info via custom annotations + per-package CR fields.backoffLimitinstead of being cleaned up, so a developer or support engineer arriving 30 minutes after a failure still has the logs.ValidateRunningPackages, the manual delete logic atskyhook_controller.go:~2365, and parts ofpod_controller.goget simpler or disappear.Capabilities unlocked
ImagePullBackOff/ErrImagePullpodFailurePolicy(k8s 1.31+ stable) — needs verification, see note belowstate: erroring. Today this needs a custom event-watch on package pods; withpodFailurePolicythe rule is declarative.ttlSecondsAfterFinishedbackoffLimit: 0restartPolicy: OnFailureinside the JobTemplate.status.failed. Adjacent to a known prior failure mode where a package pod that OOMed was not always rescheduled.activeDeadlineSecondsstate: erroring."PodReplacementPolicy: Failed(k8s 1.28+ stable)/skyhook-packageand the host root mount, where two pods racing for the same host paths is asking for trouble..status.failed+ OOM reason in Job conditionslastState.terminated.reason. With Jobs, the failure surface is structured.kubectl topper-JobNote on
podFailurePolicyfor ImagePullBackOff: The policy matches onOnExitCodes(containers that ran) andOnPodConditions(pod-level conditions likeDisruptionTarget). ImagePullBackOff manifests as a container status reason on a pod that never reached Running, so whetherpodFailurePolicyactually catches it cleanly needs verification before we rely on it as the fix. If it doesn't, the operator-side approach — watch package-pod events forFailed/BackOffreasons and mark the nodestate: erroring— is still the path, regardless of whether the underlying resource is a Pod or a Job.On
spec.suspend: considered and rejected as a pause/resume primitive. Suspending a Job with a running pod terminates the pod (SIGTERM) and recreates from scratch on resume — no checkpointing. The existingskyhook.nvidia.com/pauseannotation continues to be the right pause mechanism (block new stage scheduling, let in-flight stages finish naturally).Proposed shape
parallelism: 1, completions: 1, backoffLimit: 0, ttlSecondsAfterFinished: 86400. Same lifecycle granularity as today's pods.NodeName: <node>we set today, plus tolerations/init container/main container — essentially the body ofcreatePodFromPackagelifted into a JobTemplate.backoffLimit: 0means a failed Pod fails the Job and the operator decides if/when to recreate (matches today's behavior; we don't want the Job controller silently retrying twice while reconcile is computing what to do next).cluster_state_v2.go's setup. Child-Pod-level events still inform node state transitions for granular UX, but the Job is the unit of completion truth.Pod controller: keep the pseudo-pattern
pod_controller.gois structured as a pseudo-controller today — events from the Pod watch are mapped into the mainSkyhookReconcilerqueue rather than driving a separate reconciler. The comment in the file is honest about the reason: race conditions. TheSkyhookReconcilerreconcile loop watches three kinds (Skyhook,Node,Pod/Job) and starts by grabbing the world — reading all relevant SCRs, nodes, and child resources before computing desired state and writing back. If a separateJobReconcileralso did read-modify-write against SCR status / node annotations, the two reconcilers would step on each other regardless of per-controller concurrency limits (MaxConcurrentReconciles: 1only serializes one controller's workqueue, not across controllers). Routing everything through one SCR-keyed queue is what gives us global serialization for that SCR.Concretely:
SkyhookReconcilerqueueJobReconcilerwhoseReconcileonly enqueues SCR keysJobReconcilerwith cross-controller serializationRecommendation: keep the pseudo-pattern (A) for this migration. The split-controller question is real but not made any easier by switching from Pods to Jobs — it's its own design pass that should land separately if/when desired.
Linking related Jobs together
A node going through one full rollout creates many Jobs (uninstall → apply → config → interrupt → post-interrupt, one per package). Reviewers will want to see "these Jobs are all part of one operation." K8s has no native "linked Jobs" primitive (
JobSetexists but is built for coordinated parallel ML training — wrong shape for sequential per-stage lifecycle). The practical pattern is owner references + labels:skyhook.nvidia.com/name=<scr>— already used on pods, keep on Jobsskyhook.nvidia.com/package=<package>— already, keepskyhook.nvidia.com/node=<node>— already, keepskyhook.nvidia.com/stage=<stage>— newskyhook.nvidia.com/resource-id=<SKYHOOK_RESOURCE_ID>— promoted from agent env to Job label. Already a per-package-config unique ID; using it as a label ties every stage Job for one rollout of one package on one node into a single queryable group.Resulting UX:
kubectl skyhook node statusalready renders a friendlier rolled-up view of the same data and continues to work — it reads from node annotations, which the operator updates from Job conditions instead of Pod conditions, but the rendered output is unchanged.Files affected (high level)
operator/internal/controller/skyhook_controller.go—createPodFromPackage→createJobFromPackage;ValidateRunningPackagesand other delete-pod call sites updated to operate on Jobs (with cascading delete pulling child Pods).operator/internal/controller/pod_controller.go— pseudo-controller becomes a Job-aware reconciler (or a thin wrapper that maps both Job-level and Pod-level events into node-state updates).operator/internal/controller/cluster_state_v2.go— package-pod lookups → Job lookups (Job → child Pods throughcontroller-uidlabel as needed).operator/internal/wrapper/{skyhook,node}.go— places that look up pod-by-stage.operator/api/v1alpha1/zz.migration.X.Y.Z.go— new one-shot migration shim that handles in-flight raw Pods present at upgrade time (delete or let-finish + recreate-as-Job on next reconcile).batch/jobsverbs (get;list;watch;create;update;patch;delete) to operator ClusterRole; can drop or keep pod RBAC depending on whether we still want direct pod ops for cordon/drain edge cases.docs/designs/<date>-package-execution-as-jobs.md— design doc capturing the choices above so reviewers see the rationale before the PR; include in the same series of PRs.Migration
Operator upgrade lands and finds pre-existing raw Pods from the prior version. Two viable paths:
interruptionBudgetis honored on recreate.Recommend (1) for the version that introduces Jobs and a follow-up minor that drops the dual path. The migration shim lives in
zz.migration.<version>.goper the project's convention; it's a labeled ConfigMap watch / boot-time scan, not a long-running compatibility shim.Acceptance criteria
batch/v1Jobs on a fresh install.kubectl get jobs -n skyhookshows one Job per (SCR, package, stage, node) currently in flight.kubectl logs job/<job-name>returns the running stage's logs without the user needing to know the Pod name.ImagePullBackOff/ErrImagePull) does not retry indefinitely. The Job goes toFailed, the operator surfaces it asstate: erroring. Path: eitherpodFailurePolicydeclarative rule, or operator-side event watch — whichever works cleanly in testing.make e2e-tests) and helm e2e (make helm-tests) suites are green; in particular the cordon/drain/interrupt sequencing still works.ttlSecondsAfterFinished; failed Jobs stick until operator (or user) intervention.docs/designs/.Implications for other tickets
podFailurePolicymight handle this declaratively; needs verification of whetherImagePullBackOffmanifests as a matchable Pod condition or container status. If yes, the gap closes here; if not, the operator-side event-watch fix remains its own follow-up.kubectl logs job/<n>+kubectl describe job/<n>+kubectl get events. Worth deferring any custom debug-bundle work until/unless a real pain pattern surfaces post-Jobs.Out of scope
CRD changes. Stage / state / package model is unchanged.
Replacing the cordon/drain code path. Cordon/drain remains node-level orchestration owned by the operator; Jobs only host the per-stage execution.
Job-level parallelism (running the same package stage on multiple nodes from one Job). Sticking with one Job per (skyhook, package, stage, node) keeps lifecycle reasoning unchanged.
Replacing the Pod controller entirely. A thin Pod-level watch may still be useful for granular state transitions during long-running stages.
DeploymentPolicy logic is not affected. DeploymentPolicy controls which nodes are eligible to run and how many concurrently (compartments, budgets, fixed/linear/exponential strategies), and includes substantial cross-node failure handling —
failureThreshold(consecutive failures before halting),safetyLimit,batchThreshold, strategy-specific slowdown on failure, and status fields likeConsecutiveFailures,FailedNodes,ShouldStop. K8s has no primitive that replaces this role (PDB is the closest but solves disruption, not rollout shaping). The reconcile loop that consults DeploymentPolicy to pick the next eligible node and to decide whether to halt continues unchanged.The Jobs migration does clean up the signal source that feeds DeploymentPolicy: today the "node failed" signal is computed in
pod_controller.gofrom pod phase + container statuses + restartCount; after the migration it's the Job'sFailed=Truecondition — single source of truth, easier to reason about. DeploymentPolicy still owns what to do with that signal across the rollout.References
podFailurePolicy(handles the ImagePullBackOff case declaratively): https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policyoperator/internal/controller/pod_controller.gooperator/internal/controller/skyhook_controller.go(createPodFromPackage)