feat(kubelet): probe restartable init containers (sidecars) + readiness by indyjonesnl · Pull Request #1024 · indyjonesnl/rusternetes

indyjonesnl · 2026-06-09T17:09:08Z

Node-conformance SidecarContainers cluster. Restartable init containers (initContainers with restartPolicy=Always = sidecars) were never probed, restarted, marked ready, or even accepted with a readinessProbe. Three faithful-to-upstream fixes:

Probe + restart sidecars (runtime.rs check_liveness → evaluate_container_liveness, shared with regular containers): a sidecar's failed liveness probe stops just that container; reconcile_container_restarts + has_terminated_containers now include sidecars, so it's recreated with CrashLoopBackOff and its restartCount lands in init_container_statuses. Per-container restart, not whole-pod (upstream kuberuntime_manager.computePodActions). Verified live: sidecar with a failing liveness probe restarts individually (count climbs), pod stays Running, main container untouched.
Sidecar readiness (runtime.rs get_init_container_statuses + kubelet.rs ContainersReady): a running sidecar's ready comes from its readiness probe (initialDelay + threshold); sidecar readiness counts toward the pod's ContainersReady (upstream status/generate.go).
Validation (common/src/validation/pod.rs): allow readinessProbe/lifecycle on restartable init containers (was forbidden for all init containers; upstream forbids only without restartPolicy=Always). Verified live: a sidecar-with-readinessProbe pod is now accepted (was rejected at creation).

Mirrors upstream pkg/kubelet/prober (probes append(Containers, restartableInits)), kuberuntime_manager.computePodActions, status/generate.go, and validateInitContainers.

Full focused-conformance validation was blocked by a host mount-table exhaustion on the dev box (unrelated infra); the three behaviors are each verified live in isolation. Remaining #56 item: timeoutGracePeriodSeconds override spec.

🤖 Generated with Claude Code

indyjonesnl · 2026-06-09T21:30:56Z

Conformance validation (local, compose.sqlite.yml + dind, e2e.test v1.35)

Focused run of the 23 [FeatureGate:SidecarContainers] Probing restartable init container specs against this branch:

Before this push: 18/23 passed.
After rebasing on main (picks up fix(kubelet): don't follow redirects in HTTP probes #1023) + the startup-probe-restart fix (024032d): 20/23 passed.

The startup-probe fix flips both:

should be restarted startup probe fails ✅
should override timeoutGracePeriodSeconds when StartupProbe field is set ✅

(Verified live: a sidecar whose startup probe exceeds failureThreshold now restarts, restartCount increments and is correctly placed in initContainerStatuses.)

Remaining 3 specs — tracked as follow-ups (out of this PR's scope)

Spec	Root cause	Tracked
`should not be restarted with a non-local redirect http liveness probe`	needs locality-aware HTTP redirect (follow same-host, stop on cross-host) + `ProbeWarning` event emission	queued
`should mark readiness on pods to false … while … terminating` (×2)	broader graceful-termination gap: pods are hard-deleted within ~4s instead of lingering in `Terminating` with readiness flipping false / liveness disabled	queued

Restartable init containers (initContainers with restartPolicy=Always = sidecars) were never probed or restarted. Now: - check_liveness evaluates liveness/startup probes on sidecars too (factored into evaluate_container_liveness, shared with regular containers). On failure the sidecar is stopped individually (not a whole-pod restart, per upstream computePodActions) so reconcile recreates just it. - has_terminated_containers + reconcile_container_restarts now include restartable init containers, so a sidecar that exits/crashes is restarted with CrashLoopBackOff and its restartCount is published to init_container_statuses. Verified live: a sidecar with a failing liveness probe restarts individually (initContainerStatuses[].restartCount climbs), the pod stays Running, and the main container is untouched. Matches upstream pkg/kubelet/prober + kuberuntime_manager.computePodActions. Part of #65 / #56 node-conformance SidecarContainers probing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A running sidecar's readiness now comes from its readiness probe (initialDelaySeconds + threshold), mirroring a regular container; a started sidecar without a readiness probe is ready. Sidecar readiness counts toward the pod's ContainersReady condition (all sidecars must be ready), matching upstream pkg/kubelet/status/generate.go. Plain init containers are excluded from steady-state readiness. Part of #65 / #56 SidecarContainers probing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pod validation forbade readinessProbe and lifecycle on ALL init containers, rejecting sidecar pods (restartPolicy=Always) at creation ("Forbidden: must not be set for init containers"). Upstream (validateInitContainers) forbids these only for init containers WITHOUT restartPolicy=Always. Gate the checks on !restartable so sidecars may carry readinessProbe/lifecycle like regular containers. Adds a positive test. Part of #65 / #56 SidecarContainers probing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hold A restartable-init container (sidecar) whose startup probe fails past its failureThreshold MUST be killed and restarted, even if the liveness probe would succeed (upstream kuberuntime_manager.go computePodActions). The probe evaluator only logged a warning and returned false (no restart), so the sidecar never restarted on startup failure — restartCount stayed 0. Replace the binary startup_passed flag with a three-way outcome (Passed / Pending / Failed): Pending gates the liveness probe as before, Passed activates liveness, and Failed (threshold exceeded) now returns true so the caller restarts the container and bumps its restartCount. Fixes the conformance spec "Probing restartable init container should be restarted startup probe fails". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The HTTP prober used redirect::Policy::none(), which treats every 3xx as the final response. That made a *local* redirect (e.g. /redirect?loc=/healthz) look like an instant success instead of following it to the failing target, so the "restarted with a local redirect" sidecar spec never restarted; and while a non-local redirect did return 0 restarts, no ProbeWarning event was emitted so that spec failed too. Mirror upstream pkg/probe/http/request.go RedirectChecker(followNonLocal=false): follow same-host redirects (cap 10), but stop on a cross-host redirect and surface the 3xx response. A stopped non-local redirect is a probe success (200-399) and now emits a ProbeWarning event carrying the response body ("Probe terminated redirects, Response body: ..."), which the non-local spec waits for. A same-host redirect is followed to its real target, whose status decides success/failure (local redirect to a failing endpoint still restarts). Threads Option<&Pod> through check_probe/check_http_probe for event emission; readiness/startup callers pass None, the liveness path passes the pod. Fixes the sidecar specs "should be restarted with a local redirect http liveness probe" and "should *not* be restarted with a non-local redirect http liveness probe". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A pod in the process of terminating must report Ready=False while its containers drain, and its liveness probes must be disabled so a container that fails its health check during shutdown (e.g. an app that removes its health file on SIGTERM) is not restarted mid-termination. - On entering TerminatingPod, persist Ready=False / ContainersReady=False (and clear per-container ready flags) BEFORE the blocking container stop, so watchers observe the readiness flip during the grace period. - check_liveness short-circuits to "no restart" when deletionTimestamp is set. Mirrors upstream status_manager (terminating pod is NotReady) + prober_manager (probes stopped on pod deletion). Fixes the sidecar conformance specs "should mark readiness on pods to false while pod is in progress of terminating when a pod has a readiness probe" and "should mark readiness on pods to false and disable liveness probes while pod is in progress of terminating". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A pod with spec.os.name set to a non-linux value scheduled onto a Linux node is now rejected before any container starts: Phase=Failed, reason=PodOSNotSupported. Mirrors upstream kubelet's PodOS admit handler. Fixes node-conformance "PodOSRejection should reject pod when the node OS doesn't match pod's OS". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A hostNetwork pod's container gets the container runtime's own /etc/hosts, which lacks the pod's spec.hostAliases. When hostAliases are set, build a managed /etc/hosts from the node's /etc/hosts plus the alias entries and bind it into the container (upstream ensureHostsFile useHostNetwork branch). Pods without hostAliases keep the default file. Fixes node-conformance "Kubelet ... with hostAliases and hostNetwork should write entries to /etc/hosts when hostNetwork is enabled". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n restart When a liveness or startup probe fails and triggers a restart, the kubelet stopped the pod with the POD's terminationGracePeriodSeconds. With a long pod grace (e.g. 500s) the sandbox/pause container stop blocked for the full grace, so the restart never completed within the probe test's observation window and restartCount stayed 0. Upstream uses the failing probe's own terminationGracePeriodSeconds to kill the unit. check_liveness / evaluate_container_liveness now return the effective grace (probe-level terminationGracePeriodSeconds, falling back to the pod's, then 30) instead of a bool, and the liveness-restart path stops the pod with that grace. Fixes node-conformance "Probing container should override timeoutGracePeriodSeconds when LivenessProbe/StartupProbe field is set". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

indyjonesnl force-pushed the feat/sidecar-probes branch from 5fadf5d to 024032d Compare June 9, 2026 21:30

indyjonesnl and others added 6 commits June 9, 2026 23:53

indyjonesnl force-pushed the feat/sidecar-probes branch from 024032d to 01439db Compare June 9, 2026 22:33

indyjonesnl and others added 3 commits June 10, 2026 01:13

indyjonesnl merged commit f1427a2 into main Jun 10, 2026
3 checks passed

indyjonesnl mentioned this pull request Jun 10, 2026

Node conformance: networking granular checks + Sysctls + projected configMap #1050

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kubelet): probe restartable init containers (sidecars) + readiness#1024

feat(kubelet): probe restartable init containers (sidecars) + readiness#1024
indyjonesnl merged 9 commits into
mainfrom
feat/sidecar-probes

indyjonesnl commented Jun 9, 2026 •

edited

Loading

Uh oh!

indyjonesnl commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

indyjonesnl commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

indyjonesnl commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conformance validation (local, compose.sqlite.yml + dind, e2e.test v1.35)

Remaining 3 specs — tracked as follow-ups (out of this PR's scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

indyjonesnl commented Jun 9, 2026 •

edited

Loading

indyjonesnl commented Jun 9, 2026 •

edited

Loading