Skip to content

perf: corpus deduplication to reduce redundant regex matches #2

@slima4

Description

@slima4

Summary

Deduplicate corpus entries before cross-evaluation. Overlapping rules often generate identical strings — testing the same string against all rules multiple times is wasted work.

Problem

With 1500+ overlapping rules generating 50 samples each, many rules produce identical strings (e.g., \d{8} and \d{4,10} both generate 12345678). These duplicates inflate the corpus and multiply regex matches without adding analytical value.

Proposed Solution

After corpus generation, deduplicate by text content while tracking provenance (which rules generated each string). During evaluation, match each unique string once and attribute results to all source rules.

Expected Impact

  • 20-40% corpus size reduction for overlapping rule sets
  • Proportional reduction in regex matches
  • No change in classification results

Acceptance Criteria

  • Dedup step after generation, before evaluation
  • Provenance tracking (string → list of source rules)
  • Match matrix results identical to non-deduped path
  • Benchmark showing corpus reduction on real rule sets

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvement or optimizationpriority: lowNice to have — address when time permits

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions