Skip to content

RCA-based alert grouping: Deduplicate by root cause, not just fingerprint #7

@nomadicmehul

Description

@nomadicmehul

Summary

Enhance deduplication to group alerts that share the same root cause, even if they have different fingerprints. Currently dedup is fingerprint-based (same alert name + service + namespace). This misses cases where 5 different alerts (OOMKill, 5xx spike, latency increase, pod restart, failed health check) all stem from one bad deployment.

Why This Matters

Fingerprint dedup handles identical alerts. RCA-based grouping handles related alerts. This is what gets noise from 50 alerts to 3 actionable incidents.

Acceptance Criteria

  • After initial fingerprint dedup, run a second pass using agent reasoning
  • Agent examines top-N alert groups and identifies shared root causes
  • Merge alert groups that share root cause into a single incident
  • Track which original alerts are grouped under each incident
  • Dashboard shows "X alerts grouped into Y incidents" summary

Example

Input:  OOMKill on pod-A, 5xx on service-B (calls pod-A), latency spike on service-C (calls B)
Output: 1 incident — "pod-A OOMKill causing cascading failures to B and C"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions