feat: make regexr default backend with optional PCRE2 support#9
Merged
farhan-syah merged 5 commits intomainfrom Dec 2, 2025
Merged
feat: make regexr default backend with optional PCRE2 support#9farhan-syah merged 5 commits intomainfrom
farhan-syah merged 5 commits intomainfrom
Conversation
Switch from PCRE2 to regexr as the default regex backend, making splintr a pure-Rust tokenizer with no C dependencies. PCRE2 remains available as an optional backend via the 'pcre2' feature flag. Key changes: - Regexr backend: Pure Rust with JIT compilation and SIMD acceleration - PCRE2 backend: Now optional, enabled via --features pcre2 - Runtime switching: Added .pcre2() method to switch backends - Unified implementation: Merged Tokenizer and TokenizerRegexr into single Tokenizer class with RegexBackend enum - Documentation: Updated README, API docs, and benchmarks to reflect new default backend Benchmarking tools: - benchmark_regexr_comparison.py: Compare regexr vs PCRE2 performance - benchmark_regexr_viz.py: Visual comparison with charts This change eliminates C dependencies while maintaining performance through regexr's JIT and SIMD optimizations. Users requiring PCRE2 can opt-in via feature flags or runtime switching.
Switch regexr dependency from local path to published version 0.1.0-beta.2 on crates.io. Add .cargo/ to .gitignore to support local development overrides via cargo config patches without committing them. This enables publishing splintr while maintaining flexibility for local development with unpublished regexr changes.
Configure regexr with conditional compilation: - Unix platforms: enable jit + simd features - Windows: disable jit feature (simd only) This prevents ABI crashes on Windows x86_64 where JIT-compiled code causes segmentation faults. The platform-specific dependency ensures JIT is only enabled where it works reliably. Bump regexr to 0.1.0-beta.3 for both targets.
Replace per-test tokenizer construction with static LazyLock instances. Each test file now creates the tokenizer once on first access instead of reconstructing it for every test function. This optimization reduces test suite execution time from 60+ seconds to under 1 second by amortizing expensive regex compilation and vocabulary loading across all tests in each file. Changes: - Add LazyLock static for shared tokenizer instance - Split helper into accessor and implementation functions - Preserve existing API for variant-specific tests
Move regexr dependency from platform-specific targets to main dependencies section where it belongs, and update to version 0.1.0-beta.4. This fixes the crate not being linked properly. Box the RegexrRegex variant in RegexBackend enum to resolve clippy warning about large enum variant size difference (2912 bytes vs 64 bytes).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pcre2) for benchmarking comparisons0.1.0-beta.2) instead of path dependencyChanges
tokenizer.rsto use regexr as the default backend.cargo/to.gitignorefor local dev overridesTest plan
cargo testto verify all tests passmaturin develop --releaseand test Python bindings