Skip to content

docs(runbook): k8s upgrade post-upgrade validation (PGM-193)#29

Merged
pgmac merged 1 commit into
mainfrom
paulymac/pgm-193-k8s-upgrade-validation-runbook
May 23, 2026
Merged

docs(runbook): k8s upgrade post-upgrade validation (PGM-193)#29
pgmac merged 1 commit into
mainfrom
paulymac/pgm-193-k8s-upgrade-validation-runbook

Conversation

@pgmac
Copy link
Copy Markdown
Contributor

@pgmac pgmac commented May 23, 2026

Summary

  • Adds src/runbooks/k8s-upgrade-validation.md documenting the two failure modes that silently persist after a rolling microk8s upgrade
  • Failure Mode 1: Stale EndpointSlice IPs — endpoint controller misses pod IP changes during its restart window; services route to dead IPs. Includes detection, bulk recovery script (delete+recreate slice), and kcm leader restart procedure
  • Failure Mode 2: ingress-nginx stale Lua backend cache — handled automatically by the ingress-validate play; manual procedure also documented
  • Full post-upgrade checklist covering nodes, endpoints, ingress, ArgoCD, components, and non-running pods

Test plan

  • MkDocs build passes with no warnings
  • Runbook renders correctly in local mkdocs serve
  • Links to PIR and other runbooks resolve

Linear: PGM-193

🤖 Generated with Claude Code

Document the two failure modes (stale EndpointSlice IPs, ingress-nginx Lua
backend cache) that can silently persist after a rolling microk8s upgrade,
with detection commands, recovery options, and a full post-upgrade checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pgmac pgmac merged commit c5ed9e5 into main May 23, 2026
1 check passed
@pgmac pgmac deleted the paulymac/pgm-193-k8s-upgrade-validation-runbook branch May 23, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant