Triviality Analysis of Text Using Language Models

Overview

This script calculates and compares the per-token perplexity of input text across multiple language models, and generates visual plots to assess the triviality of the text. Triviality is computed as the ratio of log perplexities from two models—typically a smaller and a larger model—on the same text.

The workflow involves:

Loading pre-trained language models.
Processing text files to compute per-token perplexity.
Calculating summary statistics (mean, standard error).
Generating comparison plots between the models' perplexities.
Saving results in pickle files.

Requirements

Dependencies

Install the following dependencies:

pip install torch==2.2.1 transformers==4.44.2 matplotlib seaborn scipy numpy argparse

Usage

Input Arguments

The script requires several input arguments that can be passed through the command line.

--k: Start calculating perplexity from the k-th token (default: 100).
--max_words: The maximum number of words to process per file (default: 1000).
--cache_dir: Directory where the models are cached (default: ~/).
--plot_dir: Directory where generated plots will be saved (default: plots).
--pickle_dir: Directory where pickle files containing results will be saved (default: pickles).

Text Input

Text files are expected in a directory named text_chunks.
The script processes each file in this directory, calculates the perplexity for each model, and generates corresponding plots.

Example Command

python triviality_analysis.py --k 100 --max_words 1000 --cache_dir "./model_cache" --plot_dir "./plots" --pickle_dir "./pickles"

Output

Plots are saved in the plots directory, showing the comparison of perplexities between two models and the computed Triviality Score.
Results are also saved as pickle files in the pickles directory.

License

This project is licensed under the MIT License.

Author

Aharon Azulay

aazuleye@gmail.com

For further details, feel free to explore the code and modify it based on your specific needs.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
pickles		pickles
plots		plots
text_chunks		text_chunks
.DS_Store		.DS_Store
README.md		README.md
llama.py		llama.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triviality Analysis of Text Using Language Models

Overview

Requirements

Dependencies

Usage

Input Arguments

Text Input

Example Command

Output

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Triviality Analysis of Text Using Language Models

Overview

Requirements

Dependencies

Usage

Input Arguments

Text Input

Example Command

Output

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages