Migrate package execution from raw Pods to Kubernetes Jobs


## Summary

Today the operator orchestrates package execution as raw Pods (`createPodFromPackage` in `operator/internal/controller/skyhook_controller.go`, watched by the pseudo-controller in `pod_controller.go`). Switch to `batch/v1` Jobs, with a 1-completion 1-parallelism Job per (skyhook, package, stage, node). The Job controller absorbs lifecycle concerns we currently hand-roll (cleanup, completion tracking, retry semantics, pod history retention) and unlocks capabilities the raw-Pod model can't express cleanly (`podFailurePolicy`, `suspend`, `kubectl logs job/<name>`).

This is an architectural change but bounded: no CRD/schema change, control-plane swap only.

## Why Jobs

### UX wins

- **`kubectl logs job/<name>`** works across pod restarts and recreations. Today, debugging a failed stage requires identifying the right per-stage Pod name before it's GC'd; with Jobs, the Job name is stable and `kubectl logs` resolves to the most recent (or all) child Pods.
- **`kubectl describe job`** gives reviewers a built-in completion/failure summary including the active/succeeded/failed pod count and the conditions timeline. We currently surface the same info via custom annotations + per-package CR fields.
- Failed Pods linger up to `backoffLimit` instead of being cleaned up, so a developer or support engineer arriving 30 minutes after a failure still has the logs.
- **Reduced custom code**: `ValidateRunningPackages`, the manual delete logic at `skyhook_controller.go:~2365`, and parts of `pod_controller.go` get simpler or disappear.

### Capabilities unlocked

| Capability | Jobs feature | Outcome |
|---|---|---|
| Don't retry on `ImagePullBackOff` / `ErrImagePull` | `podFailurePolicy` (k8s 1.31+ stable) — *needs verification, see note below* | Potentially fixes a known operator gap: when a package's container image is missing or unpullable, the package pod retries indefinitely instead of surfacing as `state: erroring`. Today this needs a custom event-watch on package pods; with `podFailurePolicy` the rule is declarative. |
| Auto-cleanup of completed Jobs | `ttlSecondsAfterFinished` | Stop carrying a "delete the package pod when stage is done" branch. |
| Bounded retries with operator override | `backoffLimit: 0` | Operator continues to drive retry decisions; Jobs just record what happened. |
| OOM-handled retry semantics | `restartPolicy: OnFailure` inside the JobTemplate | Same primitive as today, but Jobs surface the retry count cleanly in `.status.failed`. Adjacent to a known prior failure mode where a package pod that OOMed was not always rescheduled. |
| Per-stage hard timeout (new capability we don't have today) | `activeDeadlineSeconds` | Wall-clock guard. Today a hung package pod runs indefinitely; with Jobs we can declare "this stage must finish in N seconds or it's Failed and the operator surfaces it as `state: erroring`." |
| Avoid zombie-old + new-pod overlap on retry | `PodReplacementPolicy: Failed` (k8s 1.28+ stable) | Replacement pod is only created after the failed pod has fully terminated. Particularly relevant for our hostPath mounts at `/skyhook-package` and the host root mount, where two pods racing for the same host paths is asking for trouble. |
| Cleaner OOMKilled status | `.status.failed` + OOM reason in Job conditions | OOM detection today requires reading container `lastState.terminated.reason`. With Jobs, the failure surface is structured. |
| Ecosystem alignment | First-class in k9s, ArgoCD health, Kueue, Tekton, `kubectl top` per-Job | Run-to-completion work is the standard Jobs use case; tooling treats it as a first-class concept. Today our raw-Pod orchestration is second-class in most dashboards' "is this succeeding?" views. |

**Note on `podFailurePolicy` for ImagePullBackOff:** The policy matches on `OnExitCodes` (containers that ran) and `OnPodConditions` (pod-level conditions like `DisruptionTarget`). ImagePullBackOff manifests as a *container* status reason on a pod that never reached Running, so whether `podFailurePolicy` actually catches it cleanly needs verification before we rely on it as the fix. If it doesn't, the operator-side approach — watch package-pod events for `Failed`/`BackOff` reasons and mark the node `state: erroring` — is still the path, regardless of whether the underlying resource is a Pod or a Job.

**On `spec.suspend`:** considered and rejected as a pause/resume primitive. Suspending a Job with a running pod terminates the pod (SIGTERM) and recreates from scratch on resume — no checkpointing. The existing `skyhook.nvidia.com/pause` annotation continues to be the right pause mechanism (block new stage scheduling, let in-flight stages finish naturally).

## Proposed shape

- **One Job per (skyhook, package, stage, node)** with `parallelism: 1, completions: 1, backoffLimit: 0, ttlSecondsAfterFinished: 86400`. Same lifecycle granularity as today's pods.
- **PodTemplate** carries the `NodeName: <node>` we set today, plus tolerations/init container/main container — essentially the body of `createPodFromPackage` lifted into a JobTemplate.
- **`backoffLimit: 0`** means a failed Pod fails the Job and the operator decides if/when to recreate (matches today's behavior; we don't want the Job controller silently retrying twice while reconcile is computing what to do next).
- **TTL of 24h** on success means logs stick around for support without growing forever; failed Jobs stay until the operator (or human) deletes them.
- **Watch shape**: register a Jobs watch in addition to (or replacing) the current Pods watch in `cluster_state_v2.go`'s setup. Child-Pod-level events still inform node state transitions for granular UX, but the Job is the unit of completion truth.

## Pod controller: keep the pseudo-pattern

`pod_controller.go` is structured as a pseudo-controller today — events from the Pod watch are mapped into the main `SkyhookReconciler` queue rather than driving a separate reconciler. The comment in the file is honest about the reason: race conditions. The `SkyhookReconciler` reconcile loop watches three kinds (`Skyhook`, `Node`, `Pod`/`Job`) and starts by *grabbing the world* — reading all relevant SCRs, nodes, and child resources before computing desired state and writing back. If a separate `JobReconciler` also did read-modify-write against SCR status / node annotations, the two reconcilers would step on each other regardless of per-controller concurrency limits (`MaxConcurrentReconciles: 1` only serializes *one* controller's workqueue, not across controllers). Routing everything through one SCR-keyed queue is what gives us global serialization for that SCR.

Concretely:

| Option | Shape | Reality |
|---|---|---|
| A. Pseudo-pattern (current) | Job watch → mapper → `SkyhookReconciler` queue | Preserves single-key serialization; the pattern that already works. |
| B. Real `JobReconciler` whose `Reconcile` only enqueues SCR keys | Watch + map, no SCR-state writes | Same race property as A; just spreads the same logic across two files. Marginal cleanup at best. |
| C. Real `JobReconciler` with cross-controller serialization | Per-SCR mutex / leader-style coordination | Real separation of concerns, but adds a synchronization primitive that has to be designed, reasoned about, and tested. |

**Recommendation: keep the pseudo-pattern (A) for this migration.** The split-controller question is real but not made any easier by switching from Pods to Jobs — it's its own design pass that should land separately if/when desired.

## Linking related Jobs together

A node going through one full rollout creates many Jobs (uninstall → apply → config → interrupt → post-interrupt, one per package). Reviewers will want to see "these Jobs are all part of one operation." K8s has no native "linked Jobs" primitive (`JobSet` exists but is built for coordinated parallel ML training — wrong shape for sequential per-stage lifecycle). The practical pattern is owner references + labels:

- **OwnerReferences** on every Job point to the Skyhook CR (same as Pods today). Cascade delete + provenance lookup come for free.
- **Labels** for query-ability:
  - `skyhook.nvidia.com/name=<scr>` — already used on pods, keep on Jobs
  - `skyhook.nvidia.com/package=<package>` — already, keep
  - `skyhook.nvidia.com/node=<node>` — already, keep
  - `skyhook.nvidia.com/stage=<stage>` — new
  - `skyhook.nvidia.com/resource-id=<SKYHOOK_RESOURCE_ID>` — promoted from agent env to Job label. Already a per-package-config unique ID; using it as a label ties every stage Job for one rollout of one package on one node into a single queryable group.

Resulting UX:

```bash
# all Jobs in one SCR
kubectl get jobs -l skyhook.nvidia.com/name=gpu-init

# one rollout on one node — uninstall, apply, config, interrupt, post-interrupt
kubectl get jobs -l skyhook.nvidia.com/resource-id=<id>

# every Job touching a single node across SCRs
kubectl get jobs -l skyhook.nvidia.com/node=worker-7
```

`kubectl skyhook node status` already renders a friendlier rolled-up view of the same data and continues to work — it reads from node annotations, which the operator updates from Job conditions instead of Pod conditions, but the rendered output is unchanged.

## Files affected (high level)

- `operator/internal/controller/skyhook_controller.go` — `createPodFromPackage` → `createJobFromPackage`; `ValidateRunningPackages` and other delete-pod call sites updated to operate on Jobs (with cascading delete pulling child Pods).
- `operator/internal/controller/pod_controller.go` — pseudo-controller becomes a Job-aware reconciler (or a thin wrapper that maps both Job-level and Pod-level events into node-state updates).
- `operator/internal/controller/cluster_state_v2.go` — package-pod lookups → Job lookups (Job → child Pods through `controller-uid` label as needed).
- `operator/internal/wrapper/{skyhook,node}.go` — places that look up pod-by-stage.
- `operator/api/v1alpha1/zz.migration.X.Y.Z.go` — new one-shot migration shim that handles in-flight raw Pods present at upgrade time (delete or let-finish + recreate-as-Job on next reconcile).
- RBAC: add `batch/jobs` verbs (`get;list;watch;create;update;patch;delete`) to operator ClusterRole; can drop or keep pod RBAC depending on whether we still want direct pod ops for cordon/drain edge cases.
- Tests across the above (Ginkgo specs).
- `docs/designs/<date>-package-execution-as-jobs.md` — design doc capturing the choices above so reviewers see the rationale before the PR; include in the same series of PRs.

## Migration

Operator upgrade lands and finds pre-existing raw Pods from the prior version. Two viable paths:

1. **Let in-flight Pods finish, only create Jobs for new stages.** Lower risk, but the operator has to keep the pod-level reconcile path alive for one minor.
2. **Proactively delete in-flight Pods, recreate as Jobs.** Simpler code (remove the dual path) but interrupts an in-flight rollout — reasonable if `interruptionBudget` is honored on recreate.

Recommend (1) for the version that introduces Jobs and a follow-up minor that drops the dual path. The migration shim lives in `zz.migration.<version>.go` per the project's convention; it's a labeled ConfigMap watch / boot-time scan, not a long-running compatibility shim.

## Acceptance criteria

- All package execution goes through `batch/v1` Jobs on a fresh install. `kubectl get jobs -n skyhook` shows one Job per (SCR, package, stage, node) currently in flight.
- `kubectl logs job/<job-name>` returns the running stage's logs without the user needing to know the Pod name.
- A package whose image is unpullable (`ImagePullBackOff` / `ErrImagePull`) does not retry indefinitely. The Job goes to `Failed`, the operator surfaces it as `state: erroring`. Path: either `podFailurePolicy` declarative rule, or operator-side event watch — whichever works cleanly in testing.
- Existing chainsaw e2e (`make e2e-tests`) and helm e2e (`make helm-tests`) suites are green; in particular the cordon/drain/interrupt sequencing still works.
- Successful Jobs auto-clean after `ttlSecondsAfterFinished`; failed Jobs stick until operator (or user) intervention.
- An upgrade from a prior raw-Pod version to this Jobs version completes without leaving orphaned package Pods (the migration shim is exercised in a chainsaw upgrade test).
- New design doc under `docs/designs/`.

## Implications for other tickets

- **Possibly addresses the ImagePullBackOff retry gap** — today, when a package's image is unpullable, the pod retries indefinitely instead of being surfaced as a hard failure. `podFailurePolicy` *might* handle this declaratively; needs verification of whether `ImagePullBackOff` manifests as a matchable Pod condition or container status. If yes, the gap closes here; if not, the operator-side event-watch fix remains its own follow-up.
- **Likely supersedes the "kubectl-skyhook debug collect" idea** — collecting CR yaml + node annotations + pod logs + events into a tarball is much cheaper once it's `kubectl logs job/<n>` + `kubectl describe job/<n>` + `kubectl get events`. Worth deferring any custom debug-bundle work until/unless a real pain pattern surfaces post-Jobs.

## Out of scope

- CRD changes. Stage / state / package model is unchanged.
- Replacing the cordon/drain code path. Cordon/drain remains node-level orchestration owned by the operator; Jobs only host the per-stage execution.
- Job-level parallelism (running the same package stage on multiple nodes from one Job). Sticking with one Job per (skyhook, package, stage, node) keeps lifecycle reasoning unchanged.
- Replacing the Pod controller entirely. A thin Pod-level watch may still be useful for granular state transitions during long-running stages.
- **DeploymentPolicy logic is not affected.** DeploymentPolicy controls *which* nodes are eligible to run and *how many* concurrently (compartments, budgets, fixed/linear/exponential strategies), and includes substantial cross-node failure handling — `failureThreshold` (consecutive failures before halting), `safetyLimit`, `batchThreshold`, strategy-specific slowdown on failure, and status fields like `ConsecutiveFailures`, `FailedNodes`, `ShouldStop`. K8s has no primitive that replaces this role (PDB is the closest but solves disruption, not rollout shaping). The reconcile loop that consults DeploymentPolicy to pick the next eligible node and to decide whether to halt continues unchanged.

  The Jobs migration *does* clean up the **signal source** that feeds DeploymentPolicy: today the "node failed" signal is computed in `pod_controller.go` from pod phase + container statuses + restartCount; after the migration it's the Job's `Failed=True` condition — single source of truth, easier to reason about. DeploymentPolicy still owns what to do with that signal across the rollout.

## References

- Kubernetes Job docs: <https://kubernetes.io/docs/concepts/workloads/controllers/job/>
- `podFailurePolicy` (handles the ImagePullBackOff case declaratively): <https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy>
- Existing pseudo-pod-controller: `operator/internal/controller/pod_controller.go`
- Pod creation: `operator/internal/controller/skyhook_controller.go` (`createPodFromPackage`)


Capability	Jobs feature	Outcome
Don't retry on `ImagePullBackOff` / `ErrImagePull`	`podFailurePolicy` (k8s 1.31+ stable) — needs verification, see note below	Potentially fixes a known operator gap: when a package's container image is missing or unpullable, the package pod retries indefinitely instead of surfacing as `state: erroring`. Today this needs a custom event-watch on package pods; with `podFailurePolicy` the rule is declarative.
Auto-cleanup of completed Jobs	`ttlSecondsAfterFinished`	Stop carrying a "delete the package pod when stage is done" branch.
Bounded retries with operator override	`backoffLimit: 0`	Operator continues to drive retry decisions; Jobs just record what happened.
OOM-handled retry semantics	`restartPolicy: OnFailure` inside the JobTemplate	Same primitive as today, but Jobs surface the retry count cleanly in `.status.failed`. Adjacent to a known prior failure mode where a package pod that OOMed was not always rescheduled.
Per-stage hard timeout (new capability we don't have today)	`activeDeadlineSeconds`	Wall-clock guard. Today a hung package pod runs indefinitely; with Jobs we can declare "this stage must finish in N seconds or it's Failed and the operator surfaces it as `state: erroring`."
Avoid zombie-old + new-pod overlap on retry	`PodReplacementPolicy: Failed` (k8s 1.28+ stable)	Replacement pod is only created after the failed pod has fully terminated. Particularly relevant for our hostPath mounts at `/skyhook-package` and the host root mount, where two pods racing for the same host paths is asking for trouble.
Cleaner OOMKilled status	`.status.failed` + OOM reason in Job conditions	OOM detection today requires reading container `lastState.terminated.reason`. With Jobs, the failure surface is structured.
Ecosystem alignment	First-class in k9s, ArgoCD health, Kueue, Tekton, `kubectl top` per-Job	Run-to-completion work is the standard Jobs use case; tooling treats it as a first-class concept. Today our raw-Pod orchestration is second-class in most dashboards' "is this succeeding?" views.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate package execution from raw Pods to Kubernetes Jobs #223

Summary

Why Jobs

UX wins

Capabilities unlocked

Proposed shape

Pod controller: keep the pseudo-pattern

Linking related Jobs together

Files affected (high level)

Migration

Acceptance criteria

Implications for other tickets

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Shape	Reality
A. Pseudo-pattern (current)	Job watch → mapper → `SkyhookReconciler` queue	Preserves single-key serialization; the pattern that already works.
B. Real `JobReconciler` whose `Reconcile` only enqueues SCR keys	Watch + map, no SCR-state writes	Same race property as A; just spreads the same logic across two files. Marginal cleanup at best.
C. Real `JobReconciler` with cross-controller serialization	Per-SCR mutex / leader-style coordination	Real separation of concerns, but adds a synchronization primitive that has to be designed, reasoned about, and tested.

Uh oh!

Migrate package execution from raw Pods to Kubernetes Jobs #223

Description

Summary

Why Jobs

UX wins

Capabilities unlocked

Proposed shape

Pod controller: keep the pseudo-pattern

Linking related Jobs together

Files affected (high level)

Migration

Acceptance criteria

Implications for other tickets

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions