RCA Operator for Kubernetes
Cluster-native incident detection, durable incident state, CRD-driven correlation rules, notifications, and dashboarding
RCA Operator is a Kubernetes-native incident detection operator that:
- collects failure signals from native Kubernetes APIs (pods, events, nodes, deployments)
- evaluates CRD-driven correlation rules (
RCACorrelationRule) to detect multi-signal incidents - persists durable incident state in
IncidentReportCRDs - manages incident lifecycle:
Detecting→Active→Resolved - notifies humans via Slack and PagerDuty from incident lifecycle state
- serves a built-in dashboard (light/dark theme) backed only by
IncidentReportandRCAAgentCRDs
The operator avoids AI systems, external databases, and log-scraping dependencies so it stays easy to run and reason about in-cluster.
More detail lives in Architecture and Phase 2 Release Notes.
| Feature | Description |
|---|---|
| Native Kubernetes signal collection | Reads pod, event, node, and workload state from Kubernetes (Deployments, StatefulSets, DaemonSets, Jobs, CronJobs) |
| CRD-driven correlation rules | RCACorrelationRule CRDs define multi-signal rules — no Go code changes needed |
| Automatic rule detection | Mines the correlation buffer for recurring signal patterns and auto-creates RCACorrelationRule CRDs |
| Durable incident records | Deduplicates repeated signals into one IncidentReport per fingerprint |
| Incident lifecycle | Tracks Detecting, Active, and Resolved phases |
| Notifications | Sends Slack and PagerDuty notifications and emits Kubernetes events |
| Dashboard | Built-in incident dashboard with light/dark theme toggle, workload + service topology views, and an inline Jaeger trace detail modal (no Jaeger UI hop) |
| Retention | Automatically prunes old resolved incidents |
| OpenTelemetry | Optional OTLP trace export for the operator's own spans |
curl -fsSL https://raw.githubusercontent.com/gaurangkudale/RCA-Operator/main/scripts/install.sh | bashThe installer verifies prerequisites, adds the Helm repo, installs the chart into
rca-system (creating the namespace if needed), and waits for everything to be Ready.
A starter RCAAgent is created automatically so the operator begins detecting
incidents immediately — no extra kubectl apply required.
Common overrides (set as environment variables before the curl):
| Variable | Default | Description |
|---|---|---|
RCA_NAMESPACE |
rca-system |
Install namespace |
RCA_RELEASE |
rca-operator |
Helm release name |
RCA_PROFILE |
full |
full (operator + bundled otel-collector + Jaeger) or minimal (operator only) |
RCA_CHART_VERSION |
latest | Pin a specific chart version |
RCA_VALUES_FILE |
— | Path to an extra --values file |
# One repo, one install — otel-collector and Jaeger are bundled as optional
# sub-charts and enabled by default.
helm repo add rca-operator https://gaurangkudale.github.io/rca-operator.github.io/charts
helm repo update
helm upgrade --install rca-operator rca-operator/rca-operator \
--namespace rca-system --create-namespace \
--wait --timeout 10m
--waitis required — theOpenTelemetryCollectorandInstrumentationCRs are applied as post-install hooks after the otel-operator webhook is confirmed Ready.
The default chart values are the full profile. For a leaner install pass
--set opentelemetryOperator.enabled=false --set jaeger.enabled=false, or use
helm/values-minimal.yaml / helm/values-external-observability.yaml from a
source checkout. See Installation.
kubectl apply -f https://github.com/gaurangkudale/RCA-Operator/releases/latest/download/install.yaml
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml| Section | Description |
|---|---|
| Prerequisites | Cluster and tooling requirements |
| Installation | Helm and kubectl installation |
| Quick Start | Deploy your first agent in minutes |
| Monitor a Namespace End-to-End | Go from zero monitoring to incidents + traces for an existing multi-language namespace |
| Architecture | System design and data flow |
| Phase 2 Release Notes | What's new in the Phase 2 release |
| Production Guide | Production sizing, security, RBAC, network policy, retention, and cardinality guidance |
| Phase 1 Architecture | Historical Kubernetes-native foundation design |
| RCAAgent CRD Reference | RCAAgent schema and examples |
| IncidentReport CRD Reference | IncidentReport schema and fields |
| RCACorrelationRule CRD Reference | Correlation rule schema and examples |
| Auto-Detection | Automatic correlation rule detection |
| OTLP Ingest | In-operator OTLP/HTTP receiver for traces and logs |
| Topology Graph | Incident topology graph (K8s + trace + Jaeger enrichment) |
| Dashboard | Dashboard data model and access patterns |
| Metrics Reference | Prometheus metrics exposed by the operator |
| RBAC Reference | Permissions used by the operator |
| Local Development | Run locally against a cluster |
| Testing Guide | Unit, envtest, and e2e coverage |
| Helm Reference | Override flags, from-source install, upgrade, troubleshooting |
| Helm Upgrade Guide | CRD upgrade and migration steps |
The main configuration resource. One agent can watch multiple namespaces and optionally configure notifications and retention.
kubectl get rcaagent -A
kubectl describe rcaagent <name> -n <namespace>Created automatically for detected incidents. Each report carries the incident fingerprint, lifecycle phase, severity, affected resources, and timeline.
kubectl get incidentreport -A
kubectl describe incidentreport <name> -n <namespace>Cluster-scoped rules that define multi-signal correlation logic. Rules are loaded dynamically — no operator restart needed when rules change.
kubectl get rcacorrelationrules
kubectl describe rcacorrelationrule <name>Four default rules are installed with the Helm chart (defaultRules.enabled: true):
| Rule | Trigger | Condition | Severity |
|---|---|---|---|
node-plus-eviction |
NodeNotReady | PodEvicted on same node | P1 Critical |
crashloop-plus-oom |
CrashLoopBackOff | OOMKilled on same pod | P2 High |
crashloop-plus-deploy |
CrashLoopBackOff | StalledRollout in same namespace | P2 High |
imagepull-no-history |
ImagePullBackOff | No PodHealthy on same pod | P2 High |
When auto-detection is enabled (--enable-autodetect), the operator also creates rules automatically from observed signal patterns. Auto-generated rules use a fixed priority of 30 (below user rules) and are labeled rca.rca-operator.tech/auto-generated: "true". See Auto-Detection for details.
Contributions are welcome — bug reports, docs, tests, correlation rules, or features.
- Read CONTRIBUTING.md and CODE_OF_CONDUCT.md.
- Find a
good first issueon the issue tracker, or open a new one to discuss larger changes before coding. make lint && make test && make buildmust pass locally.- Open a pull request — the PR template lists the merge checklist.
- Bug reports / feature requests — GitHub Issues
- Questions and design discussion — GitHub Discussions
- Security disclosures — see SECURITY.md; please do not open public issues for vulnerabilities.
Licensed under the MIT License. See LICENSE.
