[ML] AI-Powered Root Cause Grouping for Security Findings

## Summary

Implement an AI-powered Root Cause Grouping system that clusters related findings and identifies the underlying source responsible for multiple vulnerabilities, helping developers resolve issues more efficiently.

## Motivation

Security scans often generate dozens or hundreds of findings that originate from the same underlying problem.

For example:

* Multiple SQL Injection findings may originate from a single database helper utility.
* Several secret exposure findings may stem from a common configuration file.
* Dependency vulnerabilities may be introduced through a shared package.

Currently, PatchPilot displays findings individually, requiring users to manually investigate relationships between vulnerabilities.

A root cause grouping system would help users understand which findings share the same origin and prioritize fixes that eliminate multiple vulnerabilities at once.

## Proposed solution

Introduce a grouping engine that analyzes findings and clusters them based on shared characteristics.

### Inputs

```text
Finding title
Scanner metadata
Affected files
Code locations
Embeddings
Dependency relationships
```

### Example Output

```json
{
  "group_id": "RCG-001",
  "root_cause": "database_helper.py",
  "findings_count": 15,
  "findings": [
    "SQL Injection",
    "Unsafe Query Construction",
    "Missing Input Validation"
  ]
}
```

### Backend

* Generate embeddings for findings.
* Cluster related findings using similarity analysis.
* Identify likely root causes.
* Persist grouping information.

### Frontend

* Add "Root Cause Groups" view.
* Allow users to expand groups and inspect associated findings.
* Display the number of findings resolved by addressing a root cause.
* Provide filtering by group.

### Evidence Pack Integration

Include:

```text
root-cause-groups.json
root-cause-summary.txt
```

in generated evidence packs.

## ML tier (if applicable)

* [ ] Tier 1 — Triage (severity ranking, deduplication, false positive classification)
* [x] Tier 2 — Predictive (fix success prediction, exploit scoring, pattern clustering)
* [ ] Tier 3 — Autonomous (LLM patch generation, self-healing pipeline)
* [ ] Not ML-related

## Alternatives considered

1. Display all findings individually.

   * Rejected because users must manually identify relationships between findings.

2. Group findings only by scanner type.

   * Rejected because findings from different scanners may share the same root cause.

3. Rule-based grouping only.

   * Rejected because ML-based similarity analysis can discover relationships not captured by static rules.

## Acceptance criteria

* [ ] PatchPilot groups related findings into root-cause clusters.
* [ ] Users can view findings grouped by root cause.
* [ ] Group information includes affected findings and likely source locations.
* [ ] Root cause reports are included in Evidence Packs.

## Additional context

This feature complements the existing Tier 1 embedding and deduplication roadmap by leveraging embeddings to identify meaningful relationships between findings. It helps developers focus on fixes with the highest remediation impact and reduces investigation effort.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] AI-Powered Root Cause Grouping for Security Findings #144

Summary

Motivation

Proposed solution

Inputs

Example Output

Backend

Frontend

Evidence Pack Integration

ML tier (if applicable)

Alternatives considered

Acceptance criteria

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[ML] AI-Powered Root Cause Grouping for Security Findings #144

Description

Summary

Motivation

Proposed solution

Inputs

Example Output

Backend

Frontend

Evidence Pack Integration

ML tier (if applicable)

Alternatives considered

Acceptance criteria

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions