mothertoken

Every model has a native tongue. The question is whether yours matches.

Toolkit for comparing tokenizer efficiency across languages, model families, and user-supplied Hugging Face refs.

Note

The bundled benchmark is curated and representative, not exhaustive. Use direct Hugging Face refs when you want to compare tokenizers outside the starter set.

Installation

# Published package
pip install mothertoken

For local development:

git clone https://github.com/inimaz/mothertoken
cd mothertoken
uv sync
uv pip install -e .

CLI Usage

The mothertoken command is available after installation.

Common questions

I speak a language that is not English. Which tokenizer is most efficient for it?

mothertoken rank spanish

Show demo

I have this text and a chosen model. How many tokens does it use?

mothertoken tokenize "Hola Mundo" --model gpt-4o

Show demo

I am choosing between a few models. Which one tokenizes my text best?

mothertoken compare "Travesura realizada" --model gpt-oss --model Qwen/Qwen3-0.6B

Show demo

I have this model. Which languages does it tokenize best, which ones worst?

mothertoken benchmark run --models gpt-oss,YOUR_MODEL1,YOUR_MODEL2

Show demo

Rank tokenizers for a language

Rank supported tokenizers for a specific language using the precomputed benchmark data.

mothertoken rank spanish

# Raw FLORES+ codes still work
mothertoken rank spa_Latn

List tokenizers

See which tokenizer IDs can be used and which familiar models use them.

mothertoken list

Tokenize exact text

Count tokens for exact text using local tokenizers by default. Add --language to estimate the English-equivalent token count from the benchmark multiplier.

mothertoken tokenize "Hola Mundo" --language es

# Check one model
mothertoken tokenize "Hello" --model gpt-4o

# Check a Hugging Face model/tokenizer ref directly
mothertoken tokenize "Hello" --model Qwen/Qwen3-0.6B

# Estimate the English-equivalent count for a known language
mothertoken tokenize "مرحبا بالعالم" --language ar --model gpt-4o

# Compare against your own English translation
mothertoken tokenize "مرحبا بالعالم" --language ar --english-text "Hello world"

# Tokenize a file
mothertoken tokenize --file prompt.txt

# Compare translated files
mothertoken tokenize --file prompt.ar.txt --language ar --english-file prompt.en.txt

Compare selected tokenizers

Compare aliases from mothertoken list with direct Hugging Face refs. This is the main workflow when you care about a specific set of models.

mothertoken compare "Travesura realizada" \
  --model gpt-4o \
  --model Qwen/Qwen3-0.6B \
  --model mistralai/Mistral-7B-v0.1

mothertoken compare --file prompt.txt \
  --model mistralai/Mistral-7B-v0.1 \
  --model deepseek-ai/DeepSeek-V4-Pro

Benchmark data

mothertoken benchmark shows benchmark help. Use benchmark run to create benchmark data, benchmark use to choose the active benchmark, and benchmark status to inspect what commands will use.

When --output is omitted, benchmark run writes to the user-owned benchmark file and makes it active:

mothertoken benchmark run --languages eng_Latn,arb_Arab --models gpt-4o,Qwen/Qwen3-0.6B

Before it starts, the command prints the file it will write. The default user config locations are platform-specific:

OS	User config directory
Linux / XDG	`$XDG_CONFIG_HOME/mothertoken` or `~/.config/mothertoken`
macOS	`~/Library/Application Support/mothertoken`
Windows	`%APPDATA%\mothertoken`

To write somewhere else:

mothertoken benchmark run \
  --languages eng_Latn,arb_Arab \
  --models gpt-4o,Qwen/Qwen3-0.6B \
  --output benchmark.json

Make an existing benchmark active:

mothertoken benchmark use benchmark.json
mothertoken benchmark status

Return to the bundled default benchmark:

mothertoken benchmark use --default

Researcher Workflow

Benchmark regeneration and model-extension docs live in docs/benchmarking.md.

You can also benchmark a direct Hugging Face ref without adding it to default_tokenizers.yaml:

uv run mothertoken benchmark run --languages eng_Latn,arb_Arab --models Qwen/Qwen3-0.6B

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
demos/common-questions		demos/common-questions
docs		docs
scripts		scripts
src/mothertoken		src/mothertoken
tests		tests
web		web
.coverage		.coverage
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mothertoken

Installation

CLI Usage

Common questions

I speak a language that is not English. Which tokenizer is most efficient for it?

I have this text and a chosen model. How many tokens does it use?

I am choosing between a few models. Which one tokenizes my text best?

I have this model. Which languages does it tokenize best, which ones worst?

Rank tokenizers for a language

List tokenizers

Tokenize exact text

Compare selected tokenizers

Benchmark data

Researcher Workflow

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mothertoken

Installation

CLI Usage

Common questions

I speak a language that is not English. Which tokenizer is most efficient for it?

I have this text and a chosen model. How many tokens does it use?

I am choosing between a few models. Which one tokenizes my text best?

I have this model. Which languages does it tokenize best, which ones worst?

Rank tokenizers for a language

List tokenizers

Tokenize exact text

Compare selected tokenizers

Benchmark data

Researcher Workflow

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages