Tokenizer refactor by art-test-stack · Pull Request #10 · art-test-stack/gpt-lab

art-test-stack · 2026-05-18T08:42:38Z

serialization
truncation
auto
base
hf

EDIT: refactor + adding options (thats why addition > soustractions)

…izer

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors the tokenizer module into smaller, focused submodules (base, serialization, truncation, auto, hf) and adopts deterministic msgpack-based persistence with SHA-256 fingerprints. Also introduces a TokenizerTrainingParams container and orchestration helpers used by model/auto.py.

Changes:

Split tokenizer.py into base.py, serialization.py, truncation.py, hf.py, and auto.py modules.
Replace pickle-based vocab persistence with deterministic msgpack + fingerprint and a JSON descriptor; keep legacy pickle read path.
Add TokenizerTrainingParams and reroute orchestration in model/auto.py through tokenizer.auto.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
tests/test_tokenizer.py	Adds tests for serialization, truncation, hf wrapper, and auto helpers.
src/gpt_lab/utils/schemas.py	Adds `TokenizerTrainingParams` and syncs legacy fields into it.
src/gpt_lab/utils/logging.py	Makes `log_all` raise `RuntimeError` for ERROR-and-above levels.
src/gpt_lab/tokenizer/truncation.py	New module with `parse_truncated_name` and `truncated_from_pretrained`.
src/gpt_lab/tokenizer/tokenizer.py	Slimmed: delegates HF/truncation to new modules; adds msgpack save/load.
src/gpt_lab/tokenizer/serialization.py	New deterministic msgpack save/load + validation utilities.
src/gpt_lab/tokenizer/hf.py	New HF wrapper and training function with optional `tokenizers` import.
src/gpt_lab/tokenizer/corpus.py	Inlines `load_datasets`, comments out `zstd` import.
src/gpt_lab/tokenizer/bpe.py	Uses `training_params.max_chars` when available.
src/gpt_lab/tokenizer/base.py	New `_BaseTokenizer` extracted from tokenizer.py.
src/gpt_lab/tokenizer/auto.py	New orchestration helpers for vocab sizing and tokenizer build/load.
src/gpt_lab/model/auto.py	Delegates tokenizer build/resolve/compute to `tokenizer.auto`.
pyproject.toml	Adds `msgpack>=1.1.2` dependency.
README.md	Adds commented-out documentation describing the new tokenizer layout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… config

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 25 comments.

… move to benchmark folder

…e for reproducity

Copilot

Pull request overview

Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 19 changed files in this pull request and generated 8 comments.

art-test-stack added 14 commits May 14, 2026 17:23

tokenizer: added clamp init option for optimal tokenizer size

c361255

tokenizer: fix test

904cdf6

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

bed8100

…izer

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

d634087

…izer

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

2d85e62

…izer

tokenizer encoder note

a3e536b

tokenizer: clamp to truncate

b27cb51

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

14ad038

…izer

list special tokens

e605beb

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

6a65794

…izer

Merge branch 'master' of github.com:art-test-stack/gpt-lab into token…

9219505

…izer

tokenizer refactoring: serialization + truncation

2b7ac79

tokenizer: auto + hf + trainer config refactoring

5534413

tokenizer: fix corpus import + hf log error + test

3dca1a3

art-test-stack requested a review from Copilot May 18, 2026 09:26

Copilot started reviewing on behalf of art-test-stack May 18, 2026 09:26 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

art-test-stack added 2 commits May 19, 2026 14:46

tokenizer corpus: introduce byte control over char control

bde2912

tokenizer training config: integrates trainer parameters to tokenizer…

2a1f573

… config

art-test-stack requested a review from Copilot May 19, 2026 13:17

Copilot started reviewing on behalf of art-test-stack May 19, 2026 13:17 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

art-test-stack added 4 commits May 19, 2026 17:00

tokenizer: some function calls fixes

05f986e

tokenizer: fix sp token in truncated tokenizer + scaling tok header +…

8c587cd

… move to benchmark folder

tokenizer: eval with renyi and efficient entropy

d5d8f21

tokenizer: split tech report from script+impl args and improve storag…

0b7ee67

…e for reproducity

art-test-stack requested a review from Copilot May 21, 2026 14:47

Copilot started reviewing on behalf of art-test-stack May 21, 2026 14:48 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

add minimal fixes

b2a3008

art-test-stack added 2 commits May 21, 2026 22:13

fix test with wrong tokenizer.auto.build_or_load_tokenizer signature

5385b95

tokenizer: readme

6006769

art-test-stack requested a review from Copilot May 21, 2026 20:23

Copilot started reviewing on behalf of art-test-stack May 21, 2026 20:23 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

tokenizer: readme + minor fixes

083f2f2

art-test-stack merged commit c7cf4e2 into master May 21, 2026
2 checks passed

Conversation

art-test-stack commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

art-test-stack commented May 18, 2026 •

edited

Loading