This repository contains code to reproduce the main results for the paper Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations If you have any questions, please reach out to the first author at zhengbr@cs.washington.edu.
- Navigate to this directory
- Execute the following command:
conda env create --name envname --file=environment.yml - Navigate to the directory for each of the tasks for instructions on running evaluations
If you find our work helpful, please cite this paper at
@inproceedings{zheng-etal-2025-broken,
title={Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations},
author={Brian Siyuan Zheng and Alisa Liu and Orevaoghene Ahia and Jonathan Hayase and Yejin Choi and Noah A. Smith},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=WrYWolqKh3}
}