diff --git a/config/telemetry/alerts/resources-manager/projects.yaml b/config/telemetry/alerts/resources-manager/projects.yaml index e6bb2dec..fa5b79ff 100644 --- a/config/telemetry/alerts/resources-manager/projects.yaml +++ b/config/telemetry/alerts/resources-manager/projects.yaml @@ -15,5 +15,6 @@ spec: severity: critical slo_violation: "true" annotations: + runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md" summary: "Project {{ $labels.resource_name }} is stuck creating for over 60 seconds" - description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold." \ No newline at end of file + description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold." diff --git a/docs/runbooks/project-stuck-creating-slo-violation.md b/docs/runbooks/project-stuck-creating-slo-violation.md new file mode 100644 index 00000000..10fc8459 --- /dev/null +++ b/docs/runbooks/project-stuck-creating-slo-violation.md @@ -0,0 +1,120 @@ +# ProjectStuckCreatingSLOViolation + +## What This Alert Means + +A project has been in a "creating" state for more than 60 seconds without +reaching a "Ready" status. This exceeds the service level objective (SLO) for +project creation and indicates something is preventing the project from being +fully provisioned. + +The alert fires per-project, so multiple alerts may fire simultaneously if +several projects are affected. + +## Impact + +Users who created the affected project(s) are waiting longer than expected. +The project may not be usable until it reaches a Ready state. + +## Investigation Steps + +### 1. Identify the affected project + +The alert labels include `resource_name`, which identifies the project that is +stuck. Note this name for use in subsequent steps. + +### 2. Check the project status + +Use `kubectl` to inspect the project resource and its status conditions: + +```sh +kubectl get project -o yaml +``` + +Look at `.status.conditions` for any condition with `status: "False"` or a +`reason` and `message` that explain what is failing. + +### 3. Check controller manager logs + +The `milo-controller-manager` is responsible for reconciling projects. Check its +logs for errors related to the affected project: + +```sh +kubectl logs -l app=milo-controller-manager --tail=200 | grep +``` + +Look for: +- **Permission errors** (e.g., RBAC forbidden): The controller may lack + permissions to create dependent resources. +- **Resource creation failures**: Errors when creating namespaces, + ProjectControlPlane resources, or other dependent objects. +- **OOMKilled or CrashLoopBackOff**: The controller pod itself may be + unhealthy. + +### 4. Check controller pod health + +Verify the controller manager pod is running and not restarting: + +```sh +kubectl get pods -l app=milo-controller-manager +``` + +If the pod is restarting, check its resource limits and recent events: + +```sh +kubectl describe pod -l app=milo-controller-manager +``` + +### 5. Check for upstream dependencies + +Project creation depends on several subsystems. Verify these are healthy: +- **ProjectControlPlane** resources are being created and reconciled. +- **Authorization system** (e.g., OpenFGA) is reachable and responding. +- **Infrastructure cluster** connectivity is functioning. + +### 6. Check for resource conflicts + +If multiple controllers or deployment systems manage overlapping resources +(e.g., ClusterRoles, ConfigMaps), one may overwrite changes made by another. +Check for recent changes to RBAC resources: + +```sh +kubectl get clusterrole -l app=milo-controller-manager -o yaml +``` + +Look for unexpected annotations or labels that indicate a different system is +managing the same resource. + +## Common Causes + +| Cause | Indicators | +|---|---| +| RBAC permission errors | "forbidden" errors in controller logs | +| Controller OOM crashes | Pod restarts, OOMKilled events | +| Authorization service unavailable | Timeout or connection errors in logs | +| Resource ownership conflicts | Oscillating resource annotations/labels | +| High reconciliation backlog | Many projects stuck simultaneously, controller processing slowly | + +## Resolution + +Resolution depends on the root cause identified above: + +- **Permission errors**: Verify and restore the correct RBAC configuration for + the controller. +- **Controller crashes**: Increase memory limits or investigate the source of + excessive memory consumption. +- **Service unavailability**: Restore connectivity to dependent services. +- **Resource conflicts**: Ensure each deployment system manages uniquely named + resources to avoid collisions. + +After resolving the underlying issue, affected projects should automatically +reconcile and reach a Ready state. Monitor the alert to confirm it resolves. + +## Escalation + +If the alert persists after investigation and you cannot identify the root cause, +escalate to the platform engineering team with the following information: + +- The affected project name(s) +- Controller manager logs from the time of the alert +- Status of the controller manager pod(s) +- Any error messages found during investigation diff --git a/test/prometheus-rules/resources-manager/projects/projects-slo-rules.yaml b/test/prometheus-rules/resources-manager/projects/projects-slo-rules.yaml index 07c55622..8ea9b413 100644 --- a/test/prometheus-rules/resources-manager/projects/projects-slo-rules.yaml +++ b/test/prometheus-rules/resources-manager/projects/projects-slo-rules.yaml @@ -11,5 +11,6 @@ groups: severity: critical slo_violation: "true" annotations: + runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md" summary: "Project {{ $labels.resource_name }} is stuck creating for over 60 seconds" - description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold." \ No newline at end of file + description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold." diff --git a/test/prometheus-rules/resources-manager/projects/projects-slo-tests.yaml b/test/prometheus-rules/resources-manager/projects/projects-slo-tests.yaml index 8aac57dd..637b83de 100644 --- a/test/prometheus-rules/resources-manager/projects/projects-slo-tests.yaml +++ b/test/prometheus-rules/resources-manager/projects/projects-slo-tests.yaml @@ -22,6 +22,7 @@ tests: slo_violation: "true" resource_name: test-project exp_annotations: + runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md" summary: "Project test-project is stuck creating for over 60 seconds" description: "Project test-project has been in creation state for 120 seconds without reaching Ready status, which exceeds the 60-second SLO threshold." @@ -72,6 +73,7 @@ tests: slo_violation: "true" resource_name: stuck-project exp_annotations: + runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md" summary: "Project stuck-project is stuck creating for over 60 seconds" description: "Project stuck-project has been in creation state for 90 seconds without reaching Ready status, which exceeds the 60-second SLO threshold." @@ -97,5 +99,6 @@ tests: slo_violation: "true" resource_name: multi-stuck-project exp_annotations: + runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md" summary: "Project multi-stuck-project is stuck creating for over 60 seconds" - description: "Project multi-stuck-project has been in creation state for 150 seconds without reaching Ready status, which exceeds the 60-second SLO threshold." \ No newline at end of file + description: "Project multi-stuck-project has been in creation state for 150 seconds without reaching Ready status, which exceeds the 60-second SLO threshold."