Skip to content

Brianzhengca/Tokenizer-Robustness

Repository files navigation

Broken Tokens

This repository contains code to reproduce the main results for the paper Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations If you have any questions, please reach out to the first author at zhengbr@cs.washington.edu.

Quickstart

  1. Navigate to this directory
  2. Execute the following command: conda env create --name envname --file=environment.yml
  3. Navigate to the directory for each of the tasks for instructions on running evaluations

Citation

If you find our work helpful, please cite this paper at

@inproceedings{zheng-etal-2025-broken,
    title={Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations},
    author={Brian Siyuan Zheng and Alisa Liu and Orevaoghene Ahia and Jonathan Hayase and Yejin Choi and Noah A. Smith},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=WrYWolqKh3}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors