Broken Tokens

This repository contains code to reproduce the main results for the paper Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations If you have any questions, please reach out to the first author at zhengbr@cs.washington.edu.

Quickstart

Navigate to this directory
Execute the following command: conda env create --name envname --file=environment.yml
Navigate to the directory for each of the tasks for instructions on running evaluations

Citation

If you find our work helpful, please cite this paper at

@inproceedings{zheng-etal-2025-broken,
    title={Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations},
    author={Brian Siyuan Zheng and Alisa Liu and Orevaoghene Ahia and Jonathan Hayase and Yejin Choi and Noah A. Smith},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=WrYWolqKh3}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Arithmetic Task		Arithmetic Task
Codeline Description Task		Codeline Description Task
Common Morpheme Task		Common Morpheme Task
Count Characters Task		Count Characters Task
Evals		Evals
General Benchmarks		General Benchmarks
Generate Acronym Task		Generate Acronym Task
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Broken Tokens

Quickstart

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Broken Tokens

Quickstart

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages