perf: corpus deduplication to reduce redundant regex matches

## Summary

Deduplicate corpus entries before cross-evaluation. Overlapping rules often generate identical strings — testing the same string against all rules multiple times is wasted work.

## Problem

With 1500+ overlapping rules generating 50 samples each, many rules produce identical strings (e.g., `\d{8}` and `\d{4,10}` both generate `12345678`). These duplicates inflate the corpus and multiply regex matches without adding analytical value.

## Proposed Solution

After corpus generation, deduplicate by text content while tracking provenance (which rules generated each string). During evaluation, match each unique string once and attribute results to all source rules.

## Expected Impact

- **20-40%** corpus size reduction for overlapping rule sets
- Proportional reduction in regex matches
- No change in classification results

## Acceptance Criteria

- [ ] Dedup step after generation, before evaluation
- [ ] Provenance tracking (string → list of source rules)
- [ ] Match matrix results identical to non-deduped path
- [ ] Benchmark showing corpus reduction on real rule sets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: corpus deduplication to reduce redundant regex matches #2

Summary

Problem

Proposed Solution

Expected Impact

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: corpus deduplication to reduce redundant regex matches #2

Description

Summary

Problem

Proposed Solution

Expected Impact

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions