Skip to content

[ML] AI-Powered Root Cause Grouping for Security Findings #144

Description

@KolaSailaja

Summary

Implement an AI-powered Root Cause Grouping system that clusters related findings and identifies the underlying source responsible for multiple vulnerabilities, helping developers resolve issues more efficiently.

Motivation

Security scans often generate dozens or hundreds of findings that originate from the same underlying problem.

For example:

  • Multiple SQL Injection findings may originate from a single database helper utility.
  • Several secret exposure findings may stem from a common configuration file.
  • Dependency vulnerabilities may be introduced through a shared package.

Currently, PatchPilot displays findings individually, requiring users to manually investigate relationships between vulnerabilities.

A root cause grouping system would help users understand which findings share the same origin and prioritize fixes that eliminate multiple vulnerabilities at once.

Proposed solution

Introduce a grouping engine that analyzes findings and clusters them based on shared characteristics.

Inputs

Finding title
Scanner metadata
Affected files
Code locations
Embeddings
Dependency relationships

Example Output

{
  "group_id": "RCG-001",
  "root_cause": "database_helper.py",
  "findings_count": 15,
  "findings": [
    "SQL Injection",
    "Unsafe Query Construction",
    "Missing Input Validation"
  ]
}

Backend

  • Generate embeddings for findings.
  • Cluster related findings using similarity analysis.
  • Identify likely root causes.
  • Persist grouping information.

Frontend

  • Add "Root Cause Groups" view.
  • Allow users to expand groups and inspect associated findings.
  • Display the number of findings resolved by addressing a root cause.
  • Provide filtering by group.

Evidence Pack Integration

Include:

root-cause-groups.json
root-cause-summary.txt

in generated evidence packs.

ML tier (if applicable)

  • Tier 1 — Triage (severity ranking, deduplication, false positive classification)
  • Tier 2 — Predictive (fix success prediction, exploit scoring, pattern clustering)
  • Tier 3 — Autonomous (LLM patch generation, self-healing pipeline)
  • Not ML-related

Alternatives considered

  1. Display all findings individually.

    • Rejected because users must manually identify relationships between findings.
  2. Group findings only by scanner type.

    • Rejected because findings from different scanners may share the same root cause.
  3. Rule-based grouping only.

    • Rejected because ML-based similarity analysis can discover relationships not captured by static rules.

Acceptance criteria

  • PatchPilot groups related findings into root-cause clusters.
  • Users can view findings grouped by root cause.
  • Group information includes affected findings and likely source locations.
  • Root cause reports are included in Evidence Packs.

Additional context

This feature complements the existing Tier 1 embedding and deduplication roadmap by leveraging embeddings to identify meaningful relationships between findings. It helps developers focus on fixes with the highest remediation impact and reduces investigation effort.

Metadata

Metadata

Assignees

Labels

mlML related issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions