Skip to content

Release 0.9.1: regexr stable pin and CI improvements#15

Merged
farhan-syah merged 6 commits intomainfrom
fix/regexr-version-pin
Mar 15, 2026
Merged

Release 0.9.1: regexr stable pin and CI improvements#15
farhan-syah merged 6 commits intomainfrom
fix/regexr-version-pin

Conversation

@farhan-syah
Copy link
Collaborator

@farhan-syah farhan-syah commented Mar 15, 2026

Summary

  • Bump version to 0.9.1 with regexr 0.1.0 stable dependency (was 0.1.0-beta.5)
  • Restructure CI into reusable test.yml workflow + thin ci.yml wrapper
  • Add rust-cache, concurrency control, and draft PR skipping
  • Cross-compilation checks delegated to regexr's CI

Test plan

  • CI passes on all matrix targets (ubuntu, macos, windows)
  • Python binding tests pass
  • cargo publish --dry-run succeeds against crates.io regexr 0.1.0

Add `WordPieceTokenizer` for BERT-family models with full
BasicTokenizer preprocessing: lowercasing, Unicode accent
stripping (NFD decomposition), and CJK/punctuation splitting.
Implements greedy longest-match subword tokenization with `##`
continuation prefixes, UNK fallback for words exceeding max
length, and decode that reconstructs text by joining subwords.

Add `Tokenize` trait as a unified interface across all backends
(`Tokenizer`, `SentencePieceTokenizer`, `WordPieceTokenizer`),
enabling generic tokenization code via trait objects.

New dependencies: `unicode-normalization` and
`unicode-general-category` for accent stripping and punctuation
detection.
Add WordPieceTokenizer and Tokenize trait sections to the API guide
with usage examples and method reference. Update README and crate
description to reflect BPE + SentencePiece + WordPiece support.
Reformat Markdown tables in special_tokens.md for consistent column
alignment.
Move lint, multi-platform test, cross-compile, and Python binding
jobs into a reusable test.yml workflow callable by both ci.yml and
the release workflow. ci.yml becomes a thin dispatcher with
concurrency cancellation and draft-PR skipping.

Adds cross-compile checks for aarch64-unknown-linux-gnu and
x86_64-apple-darwin, and switches to rust-cache for faster CI runs.
Relax the regexr dependency from the pinned pre-release
0.1.0-beta.5 to the stable ^0.1 semver range so that patch
releases are picked up automatically.
Cross-compilation checks for aarch64 and macOS are already covered
by regexr's CI, so maintaining a duplicate job here is unnecessary.
@farhan-syah farhan-syah merged commit 0c1f885 into main Mar 15, 2026
5 checks passed
@farhan-syah farhan-syah deleted the fix/regexr-version-pin branch March 15, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant