feat: optional features, SentencePiece byte-fallback decoding by farhan-syah · Pull Request #14 · ml-rust/splintr

farhan-syah · 2026-03-11T22:15:53Z

Summary

Make rayon and regexr-jit optional features (both default-enabled) so WASM/embedded targets can opt out via --no-default-features
Fix SentencePiece decoding of <0xNN> byte-fallback tokens — properly reconstruct multi-byte UTF-8 sequences instead of passing through literal token strings
Gate all rayon call sites behind #[cfg(feature = "rayon")] with sequential fallbacks
Remove pcre2 from default features, making regexr the sole default backend
Context-aware leading-space stripping: only strip for multi-token sequences to preserve word boundaries in streaming decode

Test plan

All 65 tests pass across feature combinations (default, --no-default-features, explicit features)
Zero clippy warnings
New unit tests for parse_byte_fallback() edge cases (valid hex, invalid formats, boundary values)
Integration test for byte-fallback UTF-8 reconstruction (e.g. é via <0xC3><0xA9>)

Implement a SentencePiece unigram tokenizer alongside the existing BPE tokenizer, enabling support for Mistral V1/V2 and other models that use SentencePiece rather than tiktoken-style BPE. Core implementation in src/core/sentencepiece.rs: - Greedy longest-match encoding with score-based tie-breaking - BOS prepension and ▁ word boundary handling - Lossless and lossy decode paths with BOS/EOS skipping - SentencePieceError type for out-of-range token IDs Public API surface: - SentencePieceTokenizer and SentencePieceError re-exported from crate root - PySentencePieceTokenizer PyO3 wrapper with encode, decode, decode_lossy - bos_token_id_by_name added to pretrained API for symmetry with eos_token_id_by_name - SentencePieceTokenizer registered in the Python _core module and __init__.py Documentation updated to cover both BPE and SentencePiece APIs in the Rust and Python reference sections of docs/api_guide.md and README.md.

Update version to 0.9.0 across Cargo.toml, pyproject.toml, .version, and uv.lock. Also update the crate description and keywords to reflect the addition of SentencePiece alongside BPE.

Make regexr the sole default backend. The pcre2 feature remains available as an explicit opt-in for benchmarking purposes.

Introduce `rayon` and `regexr-jit` as named optional features, both enabled by default, so that WASM and embedded targets can opt out via `--no-default-features`. Add a `wasm` feature as a no-op marker that documents the intended build profile. Move regexr's jit/simd flags under the new `regexr-jit` feature rather than hardcoding them unconditionally.

…on usage Decode `<0xNN>` byte-fallback tokens by accumulating raw bytes and converting via `from_utf8_lossy`, which correctly reconstructs multi-byte UTF-8 sequences (e.g. 'é' encoded as <0xC3><0xA9>). Previously these tokens were passed through as literal strings, producing garbled output for non-ASCII text. Preserve the leading-space stripping only for multi-token sequences so that single-token streaming decode does not lose meaningful word boundaries. Gate all `rayon` call sites behind `#[cfg(feature = "rayon")]` with sequential fallbacks, enabling the tokenizer to build for WASM and other no-std targets without changes to call sites.

farhan-syah added 5 commits February 27, 2026 14:46

chore: bump version to 0.9.0

04e85cc

Update version to 0.9.0 across Cargo.toml, pyproject.toml, .version, and uv.lock. Also update the crate description and keywords to reflect the addition of SentencePiece alongside BPE.

chore: remove pcre2 from default features

8a9872c

Make regexr the sole default backend. The pcre2 feature remains available as an explicit opt-in for benchmarking purposes.

farhan-syah merged commit 2b663f3 into main Mar 12, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optional features, SentencePiece byte-fallback decoding#14

feat: optional features, SentencePiece byte-fallback decoding#14
farhan-syah merged 5 commits intomainfrom
0.9.0

farhan-syah commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

farhan-syah commented Mar 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant