This script calculates and compares the per-token perplexity of input text across multiple language models, and generates visual plots to assess the triviality of the text. Triviality is computed as the ratio of log perplexities from two models—typically a smaller and a larger model—on the same text.
The workflow involves:
- Loading pre-trained language models.
- Processing text files to compute per-token perplexity.
- Calculating summary statistics (mean, standard error).
- Generating comparison plots between the models' perplexities.
- Saving results in pickle files.
Install the following dependencies:
pip install torch==2.2.1 transformers==4.44.2 matplotlib seaborn scipy numpy argparseThe script requires several input arguments that can be passed through the command line.
--k: Start calculating perplexity from thek-thtoken (default:100).--max_words: The maximum number of words to process per file (default:1000).--cache_dir: Directory where the models are cached (default:~/).--plot_dir: Directory where generated plots will be saved (default:plots).--pickle_dir: Directory where pickle files containing results will be saved (default:pickles).
- Text files are expected in a directory named
text_chunks. - The script processes each file in this directory, calculates the perplexity for each model, and generates corresponding plots.
python triviality_analysis.py --k 100 --max_words 1000 --cache_dir "./model_cache" --plot_dir "./plots" --pickle_dir "./pickles"- Plots are saved in the
plotsdirectory, showing the comparison of perplexities between two models and the computed Triviality Score. - Results are also saved as pickle files in the
picklesdirectory.
This project is licensed under the MIT License.
Aharon Azulay
For further details, feel free to explore the code and modify it based on your specific needs.