Skip to content

Alerting rule recommendations: Suggest tuning to reduce noise #13

@nomadicmehul

Description

@nomadicmehul

Summary

After resolving incidents, the agent should recommend alerting rule changes to prevent the same noise from recurring. This could be threshold adjustments, grouping changes, or silencing flapping alerts.

Why This Matters

Fixing the incident is half the battle. Fixing the alerting rules that generated noise is how you prevent it from happening again. This closes the feedback loop.

Examples

  • "Alert HighMemoryUsage fired 47 times this week but never led to an incident. Recommend raising threshold from 80% to 90%."
  • "Alerts PodRestart and CrashLoopBackOff always fire together. Recommend grouping into a single alert rule."
  • "Alert HighLatency on service-X fires every deploy and auto-resolves in 2 minutes. Recommend adding a 3-minute pending period."

Acceptance Criteria

  • Track alert-to-incident conversion rate per alert rule
  • Identify low-signal alerts (high fire rate, low investigation rate)
  • Identify redundant alerts (always co-occur with another alert)
  • Generate specific recommendations with before/after rule YAML
  • CLI command: nightops alerts tune to show recommendations
  • Optional: auto-create PR to update Prometheus/Grafana alert rules

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions