Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Improve Tokenizer New Type Onboarding #1536

Description

@zhenyan-zhang-meta

🚀 The feature, motivation and pitch


As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks


  • Move TokenizerType to a centralized place
  • Check all getters of tokenizer types
  • Add documentation for future tokenizer onboard.
    • We may need to point people to update the model validation logic:
      def validate_model(
      self,
      model: Optional[Model],
      model_description: str = "model",
      ) -> None:
      if model is None:
      return
      if self.tokenizer_type == TokenizerType.NONE:
      raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")
      is_tiktoken = self.is_tiktoken()
      is_sentencepiece = self.is_sentencepiece()
      is_hf_tokenizer = self.is_hf_tokenizer()
      use_tiktoken = model.config.use_tiktoken
      use_hf_tokenizer = model.config.use_hf_tokenizer
      use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)
      if (
      (is_tiktoken and not use_tiktoken) or
      (is_hf_tokenizer and not use_hf_tokenizer) or
      (is_sentencepiece and not use_sentencepiece)
      ):
      raise RuntimeError(
      "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
      tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
      tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
      model_description,
      )
      )
      return

To test, run a model with each tokenizer type:

  • python torchchat.py generate llama2
  • python torchchat.py generate llama3
  • python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

Metadata

Metadata

Assignees

Labels

actionableItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type
No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions