Feature/complete sentencepiece tokenizer by Arpitsh7 · Pull Request #65 · AOSSIE-Org/OpenVerifiableLLM

Arpitsh7 · 2026-03-11T06:05:53Z

Summary

Completes the SentencePieceTokenizer implementation introduced in #17 by adding encode(), decode(), and load() — making SentencePieceTokenizer fully usable in the pipeline.

Background

PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the implementations. However the current SentencePieceTokenizer only supports training and lacks encode(), decode(), and load().

Since these methods are absent, the tokenizer cannot process text despite being trained, which prevents it from being used in downstream pipeline tasks.

Changes

`openverifiablellm/tokenizer/sentencepiece_tokenizer.py`

Implemented encode(text) -> list[int]
Implemented decode(ids) -> str
Implemented load(tokenizer_dir) — reloads from vocab.json + merges.txt
Added text_file.is_file() validation in train()
Added _check_loaded() guard for encode/decode

`tests/test_sentencepiece.py` ← new file

18 tests covering training, encode/decode roundtrip, load, artifact paths, special tokens, determinism, and constructor validation

Testing

pytest tests/test_sentencepiece.py -v

All 18 tests pass.

Scope

Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.

Rate limit exceeded

@Arpitsh7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 30 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4671d880-3136-476c-9df1-8a6e3f5a7a8a

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 0b54e19.

📒 Files selected for processing (4)

.gitignore
openverifiablellm/tokenizer/sentencepiece_tokenizer.py
openverifiablellm/verify.py
tests/test_sentencepiece.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Arpitsh7 · 2026-03-11T06:09:02Z

@coderabbitai review

coderabbitai · 2026-03-11T06:09:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Arpitsh7 added 2 commits March 11, 2026 03:12

feature:complete sentence_piecetokenizer

c096dbb

feature:complete sentencepiece_tokenizer

e4f567d

github-actions bot added backend ci-cd configuration documentation Improvements or additions to documentation enhancement New feature or request first-time-contributor github-actions no-issue-linked pending-coderabbit-review python size/L external-contributor and removed size/L labels Mar 11, 2026

fix: ruff lint and formatting for sentencepiece tokenizer

0b54e19

github-actions bot added size/L and removed size/L labels Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/complete sentencepiece tokenizer#65

Feature/complete sentencepiece tokenizer#65
Arpitsh7 wants to merge 3 commits intoAOSSIE-Org:mainfrom
Arpitsh7:feature/complete-sentencepiece_tokenizer

Arpitsh7 commented Mar 11, 2026

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Arpitsh7 commented Mar 11, 2026

Uh oh!

coderabbitai bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Arpitsh7 commented Mar 11, 2026

Summary

Background

Changes

openverifiablellm/tokenizer/sentencepiece_tokenizer.py

tests/test_sentencepiece.py ← new file

Testing

Scope

Self-contained change. Touches only sentencepiece_tokenizer.py and tests/test_sentencepiece.py. No overlap with any open PR.

Related

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Arpitsh7 commented Mar 11, 2026

Uh oh!

coderabbitai bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`openverifiablellm/tokenizer/sentencepiece_tokenizer.py`

`tests/test_sentencepiece.py` ← new file

coderabbitai bot commented Mar 11, 2026 •

edited

Loading