Skip to content

feat(k8s): Phase 4 — workload cost attribution + health audit#152

Merged
rafeegnash merged 2 commits into
masterfrom
feat/cost-k8s-phase4
Apr 30, 2026
Merged

feat(k8s): Phase 4 — workload cost attribution + health audit#152
rafeegnash merged 2 commits into
masterfrom
feat/cost-k8s-phase4

Conversation

@rafeegnash
Copy link
Copy Markdown
Collaborator

Summary

Phase 4 ships the two largest remaining K8s gaps from the Tier-2 backlog:

  • K13 — `clanker k8s cost`: per-workload cost attribution. Pod share = max(cpu_request / node_alloc_cpu, mem_request / node_alloc_mem); pod cost = node_hourly × share (Kubecost-style model). Static AWS on-demand fallback table; `--prices ` lets operators plug in their own table and falls back to the static one on misses. Aggregations: `--by pod|workload|namespace|node`.
  • K11 — `clanker k8s workloads audit`: cluster-wide health rollup. Classifies issues into CrashLoopBackOff / OOMKilled / ImagePullBackOff / RestartSpike / NotReady / NodePressure / Other categories, surfaces a "Hot pods" top-25 with the most issues. Wraps the existing `DiagnosticsManager.DetectClusterIssues` so it inherits all the per-resource detectors.

Commits

  • `feat(k8s): clanker k8s cost — per-workload cost attribution`
  • `feat(k8s): clanker k8s workloads audit — health rollup`

Test plan

  • `make ci` (fmt → vet → test-short → build) passes
  • New `internal/k8s/cost/` (3 files): attributor + price lookups + quantity parsers — 11 tests
  • New `internal/k8s/sre/workload_health.go` — 4 tests (healthy, classification, hot-pods trim, classifyIssue table-driven)
  • New `cmd/k8s_cost_print_test.go` and `cmd/k8s_workloads_audit_print_test.go` — 9 tests
  • Smoke against a real cluster: `clanker k8s cost --by workload` and `clanker k8s workloads audit`

nash added 2 commits April 29, 2026 14:54
Estimate Kubernetes spend by attributing each pod's share of its host
node's hourly price. Pod share = max(cpu_request / node_alloc_cpu,
mem_request / node_alloc_mem); pod cost = node_hourly * share. This is
the standard Kubecost-style model.

  • new internal/k8s/cost package — WorkloadCostAttributor with
    aggregations by pod, workload, namespace, and node
  • static AWS on-demand price fallback (m5/c5/r5/t3 + i variants);
    operators with real billing data plug in via --prices <file> and
    fall back to the static table on misses (CompositePriceLookup)
  • CPU/memory quantity parsers cover all standard k8s suffixes
    (m, Ki/Mi/Gi, k/M/G, plain bytes)
  • ReplicaSet pods unwrap to their parent Deployment by stripping the
    podtemplate hash; Job pods named with a trailing unix-ts unwrap to
    CronJob; orphan pods report kind=Pod
  • clanker k8s cost subcommand: --by pod|workload|namespace|node,
    --prices override, --top N, JSON output

Read-only — only kubectl get is invoked. Tests cover happy path,
error propagation, divide-by-zero on broken nodes, owner-ref unwrap
edge cases, and provider detection.
One-shot "what's broken in this cluster" report. Wraps the existing
DiagnosticsManager.DetectClusterIssues and classifies each issue into
reliability categories so operators see a count first, the long flat
issue list second.

Categories:
  • CrashLoopBackOff
  • OOMKilled  (matches both "OOMKilled" and "OOM killed" wordings)
  • ImagePullBackOff / ErrImagePull
  • RestartSpike (≥5 restarts)
  • NotReady
  • NodePressure (Memory/Disk/PID/NetworkUnavailable)
  • Other

Output also surfaces "Hot pods" — the top 25 pods with the most issues,
flagged with their categories so operators know where to look first.

Read-only — only kubectl get is invoked. Tests cover the healthy-cluster
empty case, classification across all categories, and the HotPods
top-N trim.
@rafeegnash rafeegnash merged commit 40bf4f8 into master Apr 30, 2026
5 checks passed
@rafeegnash rafeegnash deleted the feat/cost-k8s-phase4 branch April 30, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant