feat(kubelet): probe restartable init containers (sidecars) + readiness#1024
Merged
Conversation
5fadf5d to
024032d
Compare
Owner
Author
Conformance validation (local, compose.sqlite.yml + dind, e2e.test v1.35)Focused run of the 23
The startup-probe fix flips both:
(Verified live: a sidecar whose startup probe exceeds Remaining 3 specs — tracked as follow-ups (out of this PR's scope)
|
Restartable init containers (initContainers with restartPolicy=Always = sidecars) were never probed or restarted. Now: - check_liveness evaluates liveness/startup probes on sidecars too (factored into evaluate_container_liveness, shared with regular containers). On failure the sidecar is stopped individually (not a whole-pod restart, per upstream computePodActions) so reconcile recreates just it. - has_terminated_containers + reconcile_container_restarts now include restartable init containers, so a sidecar that exits/crashes is restarted with CrashLoopBackOff and its restartCount is published to init_container_statuses. Verified live: a sidecar with a failing liveness probe restarts individually (initContainerStatuses[].restartCount climbs), the pod stays Running, and the main container is untouched. Matches upstream pkg/kubelet/prober + kuberuntime_manager.computePodActions. Part of #65 / #56 node-conformance SidecarContainers probing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A running sidecar's readiness now comes from its readiness probe (initialDelaySeconds + threshold), mirroring a regular container; a started sidecar without a readiness probe is ready. Sidecar readiness counts toward the pod's ContainersReady condition (all sidecars must be ready), matching upstream pkg/kubelet/status/generate.go. Plain init containers are excluded from steady-state readiness. Part of #65 / #56 SidecarContainers probing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pod validation forbade readinessProbe and lifecycle on ALL init containers,
rejecting sidecar pods (restartPolicy=Always) at creation
("Forbidden: must not be set for init containers"). Upstream
(validateInitContainers) forbids these only for init containers WITHOUT
restartPolicy=Always. Gate the checks on !restartable so sidecars may carry
readinessProbe/lifecycle like regular containers. Adds a positive test.
Part of #65 / #56 SidecarContainers probing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hold A restartable-init container (sidecar) whose startup probe fails past its failureThreshold MUST be killed and restarted, even if the liveness probe would succeed (upstream kuberuntime_manager.go computePodActions). The probe evaluator only logged a warning and returned false (no restart), so the sidecar never restarted on startup failure — restartCount stayed 0. Replace the binary startup_passed flag with a three-way outcome (Passed / Pending / Failed): Pending gates the liveness probe as before, Passed activates liveness, and Failed (threshold exceeded) now returns true so the caller restarts the container and bumps its restartCount. Fixes the conformance spec "Probing restartable init container should be restarted startup probe fails". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The HTTP prober used redirect::Policy::none(), which treats every 3xx as the
final response. That made a *local* redirect (e.g. /redirect?loc=/healthz) look
like an instant success instead of following it to the failing target, so the
"restarted with a local redirect" sidecar spec never restarted; and while a
non-local redirect did return 0 restarts, no ProbeWarning event was emitted so
that spec failed too.
Mirror upstream pkg/probe/http/request.go RedirectChecker(followNonLocal=false):
follow same-host redirects (cap 10), but stop on a cross-host redirect and
surface the 3xx response. A stopped non-local redirect is a probe success
(200-399) and now emits a ProbeWarning event carrying the response body
("Probe terminated redirects, Response body: ..."), which the non-local spec
waits for. A same-host redirect is followed to its real target, whose status
decides success/failure (local redirect to a failing endpoint still restarts).
Threads Option<&Pod> through check_probe/check_http_probe for event emission;
readiness/startup callers pass None, the liveness path passes the pod.
Fixes the sidecar specs "should be restarted with a local redirect http
liveness probe" and "should *not* be restarted with a non-local redirect http
liveness probe".
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A pod in the process of terminating must report Ready=False while its containers drain, and its liveness probes must be disabled so a container that fails its health check during shutdown (e.g. an app that removes its health file on SIGTERM) is not restarted mid-termination. - On entering TerminatingPod, persist Ready=False / ContainersReady=False (and clear per-container ready flags) BEFORE the blocking container stop, so watchers observe the readiness flip during the grace period. - check_liveness short-circuits to "no restart" when deletionTimestamp is set. Mirrors upstream status_manager (terminating pod is NotReady) + prober_manager (probes stopped on pod deletion). Fixes the sidecar conformance specs "should mark readiness on pods to false while pod is in progress of terminating when a pod has a readiness probe" and "should mark readiness on pods to false and disable liveness probes while pod is in progress of terminating". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
024032d to
01439db
Compare
A pod with spec.os.name set to a non-linux value scheduled onto a Linux node is now rejected before any container starts: Phase=Failed, reason=PodOSNotSupported. Mirrors upstream kubelet's PodOS admit handler. Fixes node-conformance "PodOSRejection should reject pod when the node OS doesn't match pod's OS". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A hostNetwork pod's container gets the container runtime's own /etc/hosts, which lacks the pod's spec.hostAliases. When hostAliases are set, build a managed /etc/hosts from the node's /etc/hosts plus the alias entries and bind it into the container (upstream ensureHostsFile useHostNetwork branch). Pods without hostAliases keep the default file. Fixes node-conformance "Kubelet ... with hostAliases and hostNetwork should write entries to /etc/hosts when hostNetwork is enabled". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n restart When a liveness or startup probe fails and triggers a restart, the kubelet stopped the pod with the POD's terminationGracePeriodSeconds. With a long pod grace (e.g. 500s) the sandbox/pause container stop blocked for the full grace, so the restart never completed within the probe test's observation window and restartCount stayed 0. Upstream uses the failing probe's own terminationGracePeriodSeconds to kill the unit. check_liveness / evaluate_container_liveness now return the effective grace (probe-level terminationGracePeriodSeconds, falling back to the pod's, then 30) instead of a bool, and the liveness-restart path stops the pod with that grace. Fixes node-conformance "Probing container should override timeoutGracePeriodSeconds when LivenessProbe/StartupProbe field is set". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Node-conformance SidecarContainers cluster. Restartable init containers (initContainers with restartPolicy=Always = sidecars) were never probed, restarted, marked ready, or even accepted with a readinessProbe. Three faithful-to-upstream fixes:
runtime.rscheck_liveness →evaluate_container_liveness, shared with regular containers): a sidecar's failed liveness probe stops just that container;reconcile_container_restarts+has_terminated_containersnow include sidecars, so it's recreated with CrashLoopBackOff and its restartCount lands ininit_container_statuses. Per-container restart, not whole-pod (upstreamkuberuntime_manager.computePodActions). Verified live: sidecar with a failing liveness probe restarts individually (count climbs), pod stays Running, main container untouched.runtime.rsget_init_container_statuses +kubelet.rsContainersReady): a running sidecar'sreadycomes from its readiness probe (initialDelay + threshold); sidecar readiness counts toward the pod'sContainersReady(upstreamstatus/generate.go).common/src/validation/pod.rs): allow readinessProbe/lifecycle on restartable init containers (was forbidden for all init containers; upstream forbids only without restartPolicy=Always). Verified live: a sidecar-with-readinessProbe pod is now accepted (was rejected at creation).Mirrors upstream
pkg/kubelet/prober(probesappend(Containers, restartableInits)),kuberuntime_manager.computePodActions,status/generate.go, andvalidateInitContainers.Full focused-conformance validation was blocked by a host mount-table exhaustion on the dev box (unrelated infra); the three behaviors are each verified live in isolation. Remaining #56 item:
timeoutGracePeriodSecondsoverride spec.🤖 Generated with Claude Code