Feature/complete sentencepiece tokenizer#65
Feature/complete sentencepiece tokenizer#65Arpitsh7 wants to merge 3 commits intoAOSSIE-Org:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Completes the SentencePieceTokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making SentencePieceTokenizer fully usable in the pipeline.
Background
PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the implementations. However the current SentencePieceTokenizer only supports training and lacks encode(), decode(), and load().
Since these methods are absent, the tokenizer cannot process text despite being trained, which prevents it from being used in downstream pipeline tasks.
Changes
openverifiablellm/tokenizer/sentencepiece_tokenizer.pyencode(text) -> list[int]decode(ids) -> strload(tokenizer_dir)— reloads fromvocab.json+merges.txttext_file.is_file()validation intrain()_check_loaded()guard for encode/decodetests/test_sentencepiece.py← new fileTesting
All 18 tests pass.
Scope
Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.
Related