Skip to content

docs(runbook): add calico-cni-unauthorized runbook (PGM-209)#30

Merged
pgmac merged 1 commit into
mainfrom
paulymac/pgm-209-calico-cni-unauthorized-runbook
May 24, 2026
Merged

docs(runbook): add calico-cni-unauthorized runbook (PGM-209)#30
pgmac merged 1 commit into
mainfrom
paulymac/pgm-209-calico-cni-unauthorized-runbook

Conversation

@pgmac
Copy link
Copy Markdown
Contributor

@pgmac pgmac commented May 24, 2026

Summary

  • Adds src/runbooks/calico-cni-unauthorized.md — runbook for the failure mode where expired or wrong-SA calico-kubeconfig JWT blocks pod scheduling on a node
  • Covers both root causes (expired JWT, wrong SA after upgrade), detection, recovery, and prevention
  • Updates runbook index
  • Links to companion Nagios check: ansible PR #167 (PGM-208)

Observed twice: PGM-204 (wrong SA after v3.13→v3.29 upgrade) and 2026-05-24 (expired JWT blocking k8s01 for 25h with 6 Jiva replicas stuck ContainerCreating).

Test plan

  • Verify runbook renders in MkDocs (check workflow passes)
  • Confirm index link resolves to the new document

🤖 Generated with Claude Code

Documents the failure mode where expired or wrong-SA calico-kubeconfig
JWT blocks all new pod scheduling on a node with Unauthorized errors.
Observed twice: PGM-204 (wrong SA after upgrade) and 2026-05-24 (expired
JWT blocking k8s01 for 25h).

Covers detection, root causes, recovery (delete calico-node pod), and
prevention via Nagios check_calico_kubeconfig and calico-upgrade.yml
Phase 4 verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pgmac pgmac merged commit 22c333f into main May 24, 2026
1 check passed
@pgmac pgmac deleted the paulymac/pgm-209-calico-cni-unauthorized-runbook branch May 24, 2026 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant