Summary
Deduplicate corpus entries before cross-evaluation. Overlapping rules often generate identical strings — testing the same string against all rules multiple times is wasted work.
Problem
With 1500+ overlapping rules generating 50 samples each, many rules produce identical strings (e.g., \d{8} and \d{4,10} both generate 12345678). These duplicates inflate the corpus and multiply regex matches without adding analytical value.
Proposed Solution
After corpus generation, deduplicate by text content while tracking provenance (which rules generated each string). During evaluation, match each unique string once and attribute results to all source rules.
Expected Impact
- 20-40% corpus size reduction for overlapping rule sets
- Proportional reduction in regex matches
- No change in classification results
Acceptance Criteria
Summary
Deduplicate corpus entries before cross-evaluation. Overlapping rules often generate identical strings — testing the same string against all rules multiple times is wasted work.
Problem
With 1500+ overlapping rules generating 50 samples each, many rules produce identical strings (e.g.,
\d{8}and\d{4,10}both generate12345678). These duplicates inflate the corpus and multiply regex matches without adding analytical value.Proposed Solution
After corpus generation, deduplicate by text content while tracking provenance (which rules generated each string). During evaluation, match each unique string once and attribute results to all source rules.
Expected Impact
Acceptance Criteria