Goal
Add RE2 as the preferred regex engine when available, with fallbacks:
- RE2 (when available and pattern-compatible)
- Boost.Regex (when available)
- std::regex (final fallback)
Expected benefits of RE2
- Performance – Typically 5–10× faster than
std::regex for complex patterns and long text (DFA/NFA hybrid, no backtracking).
- Predictable complexity – Linear time O(n) in input size; no exponential blow-up on pathological patterns.
- ReDoS mitigation – Avoids catastrophic backtracking, reducing risk of regex denial-of-service from user or API-supplied patterns.
- Thread safety – RE2 objects are safe to share across threads; no extra locking in the regex layer.
- Bounded memory – Configurable limits; no unbounded memory growth during matching.
- Production use – Widely used in production; BSD-3-Clause license.
Patterns that need lookahead/lookbehind or backreferences will continue to use Boost or std::regex via the fallback path.
Approach (Option B – package managers)
- macOS: Homebrew (
brew install re2)
- Linux: vcpkg (
vcpkg install re2) alongside existing Boost
- Windows: initially no RE2; optional vcpkg later for parity
Implementation order
Benchmark strengthening (before RE2)
- Regex microbenchmark – New
regex_benchmark executable to measure compile + match time for literal, simple, and complex patterns on filename/content corpora. Establish baseline with current std/Boost implementation.
- SearchBenchmark extensions – Add
--regex-engine=auto|re2|boost|std and regex-heavy reference configs to measure end-to-end impact.
RE2 integration
- Build plumbing – CMake detection (
find_package(re2 CONFIG QUIET)), HAVE_RE2 / RE2_REGEX_AVAILABLE, CI steps to install RE2 on macOS and Linux. No runtime behavior change.
- Unified regex wrapper – Single API with engine priority (RE2 → Boost → std), pattern checks for RE2-unsupported features (lookahead/lookbehind, backrefs), caching.
- Tests & benchmarks – Unit tests for selection/fallback; re-run
regex_benchmark and search_benchmark to quantify improvement.
- Windows RE2 (optional) – vcpkg on Windows, PGO rules for RE2 target.
- Documentation – Engine order, CMake options, pattern limitations, how to run benchmarks.
References
- RE2 feasibility and pattern fallbacks:
internal-docs/archive/RE2_FEASIBILITY_STUDY.md (in main repo)
- Implementation phases and benchmark plan:
internal-docs/plans/2026-03-15_RE2_AND_BENCHMARK_PHASES.md (in main repo)
This issue is a placeholder for tracking the above work; no code changes required until implementation starts.
Goal
Add RE2 as the preferred regex engine when available, with fallbacks:
Expected benefits of RE2
std::regexfor complex patterns and long text (DFA/NFA hybrid, no backtracking).Patterns that need lookahead/lookbehind or backreferences will continue to use Boost or
std::regexvia the fallback path.Approach (Option B – package managers)
brew install re2)vcpkg install re2) alongside existing BoostImplementation order
Benchmark strengthening (before RE2)
regex_benchmarkexecutable to measure compile + match time for literal, simple, and complex patterns on filename/content corpora. Establish baseline with current std/Boost implementation.--regex-engine=auto|re2|boost|stdand regex-heavy reference configs to measure end-to-end impact.RE2 integration
find_package(re2 CONFIG QUIET)),HAVE_RE2/RE2_REGEX_AVAILABLE, CI steps to install RE2 on macOS and Linux. No runtime behavior change.regex_benchmarkandsearch_benchmarkto quantify improvement.References
internal-docs/archive/RE2_FEASIBILITY_STUDY.md(in main repo)internal-docs/plans/2026-03-15_RE2_AND_BENCHMARK_PHASES.md(in main repo)This issue is a placeholder for tracking the above work; no code changes required until implementation starts.