Skip to content

Tokenizer refactor #10

Merged
art-test-stack merged 24 commits into
masterfrom
tokenizer
May 21, 2026
Merged

Tokenizer refactor #10
art-test-stack merged 24 commits into
masterfrom
tokenizer

Conversation

@art-test-stack
Copy link
Copy Markdown
Owner

@art-test-stack art-test-stack commented May 18, 2026

  • serialization
  • truncation
  • auto
  • base
  • hf

EDIT: refactor + adding options (thats why addition > soustractions)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors the tokenizer module into smaller, focused submodules (base, serialization, truncation, auto, hf) and adopts deterministic msgpack-based persistence with SHA-256 fingerprints. Also introduces a TokenizerTrainingParams container and orchestration helpers used by model/auto.py.

Changes:

  • Split tokenizer.py into base.py, serialization.py, truncation.py, hf.py, and auto.py modules.
  • Replace pickle-based vocab persistence with deterministic msgpack + fingerprint and a JSON descriptor; keep legacy pickle read path.
  • Add TokenizerTrainingParams and reroute orchestration in model/auto.py through tokenizer.auto.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
tests/test_tokenizer.py Adds tests for serialization, truncation, hf wrapper, and auto helpers.
src/gpt_lab/utils/schemas.py Adds TokenizerTrainingParams and syncs legacy fields into it.
src/gpt_lab/utils/logging.py Makes log_all raise RuntimeError for ERROR-and-above levels.
src/gpt_lab/tokenizer/truncation.py New module with parse_truncated_name and truncated_from_pretrained.
src/gpt_lab/tokenizer/tokenizer.py Slimmed: delegates HF/truncation to new modules; adds msgpack save/load.
src/gpt_lab/tokenizer/serialization.py New deterministic msgpack save/load + validation utilities.
src/gpt_lab/tokenizer/hf.py New HF wrapper and training function with optional tokenizers import.
src/gpt_lab/tokenizer/corpus.py Inlines load_datasets, comments out zstd import.
src/gpt_lab/tokenizer/bpe.py Uses training_params.max_chars when available.
src/gpt_lab/tokenizer/base.py New _BaseTokenizer extracted from tokenizer.py.
src/gpt_lab/tokenizer/auto.py New orchestration helpers for vocab sizing and tokenizer build/load.
src/gpt_lab/model/auto.py Delegates tokenizer build/resolve/compute to tokenizer.auto.
pyproject.toml Adds msgpack>=1.1.2 dependency.
README.md Adds commented-out documentation describing the new tokenizer layout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_tokenizer.py Outdated
Comment thread src/gpt_lab/utils/logging.py
Comment thread src/gpt_lab/tokenizer/truncation.py
Comment thread src/gpt_lab/tokenizer/truncation.py Outdated
Comment thread src/gpt_lab/tokenizer/tokenizer.py
Comment thread README.md Outdated
Comment thread src/gpt_lab/tokenizer/base.py Outdated
Comment thread src/gpt_lab/tokenizer/tokenizer.py Outdated
Comment thread src/gpt_lab/tokenizer/serialization.py
Comment thread src/gpt_lab/model/auto.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 25 comments.

Comment thread src/gpt_lab/tokenizer/tokenizer.py
Comment thread src/gpt_lab/tokenizer/tokenizer.py
Comment thread src/gpt_lab/tokenizer/tokenizer.py
Comment thread src/gpt_lab/tokenizer/auto.py Outdated
Comment thread src/gpt_lab/tokenizer/auto.py Outdated
Comment thread src/gpt_lab/tokenizer/truncation.py
Comment thread src/gpt_lab/tokenizer/corpus.py
Comment thread scripts/scaling_tokenizer.py Outdated
Comment thread README.md Outdated
Comment thread src/gpt_lab/tokenizer/tokenizer.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 19 changed files in this pull request and generated 11 comments.

Comment thread src/gpt_lab/utils/schemas.py Outdated
Comment thread src/gpt_lab/tokenizer/tokenizer.py Outdated
Comment thread src/gpt_lab/tokenizer/corpus.py
Comment thread src/gpt_lab/utils/logging.py
Comment thread src/gpt_lab/tokenizer/hf.py Outdated
Comment thread scripts/benchmark/tokenizer_corpus_size.py
Comment thread scripts/benchmark/tokenizer_corpus_size.py
Comment thread docs/tokenizer_scaling.md Outdated
Comment thread docs/tokenizer_scaling.md Outdated
Comment thread docs/tokenizer_scaling.md Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 19 changed files in this pull request and generated 8 comments.

Comment thread src/gpt_lab/utils/logging.py Outdated
Comment thread src/gpt_lab/model/auto.py Outdated
Comment thread src/gpt_lab/tokenizer/tokenizer.py Outdated
Comment thread src/gpt_lab/tokenizer/hf.py
Comment thread src/gpt_lab/tokenizer/corpus.py Outdated
Comment thread src/gpt_lab/tokenizer/corpus.py
Comment thread src/gpt_lab/tokenizer/bpe.py
Comment thread README.md
@art-test-stack art-test-stack merged commit c7cf4e2 into master May 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants