Saba 📜🖋🇮🇹

About the model

Saba is a BERT model for Italian poetry. It was obtained via continued pretraining of dbmdz/bert-base-italian-xxl-cased on ~40k Italian song lyrics from Wikisource and Biblioteca Italiana. The objective was Masked Language Modeling (MLM).

The model is available on Hugging Face. The training, validation and test datasets have been available in datasets.

Evaluation

We evaluate Saba, the base model and Alberti (a prominent multilingual model for poetry) on various tasks.

Test loss and pseudo-perplexity

Evaluates the model's statistical modeling of Italian poetry by computing the cross-entropy loss and pseudo-perplexity (PPPL) on test poems with 15% masked tokens.

Model	Loss	PPPL
Base Model	3.43	30.76
Alberti	5.41	223.86
Saba	1.90	6.68

Rhyme classification

Evaluates the model's ability to capture phonetic information by predicting the rhyme (the last three characters) of words using a multinomial logistic regression classifier trained on the embeddings of multiple words. We consider the top 50 most frequent rhymes for evaluation.

Model	Accuracy (%)	Macro F1 (%)
Base Model	55.24	50.05
Alberti	67.07	63.36
Saba	71.26	69.12

Author attribution

Evaluates stylistic understanding by predicting the author of a poem using a multinomial logistic regressor fitted on the [CLS] token embeddings.

Model	Accuracy (%)	F1 Score (%)
Base Model	61.63	57.04
Alberti	64.34	60.33
Saba	70.93	65.24

Poem completion

Assesses the model's generative abilities through masked language modeling by predicting the masked final word of a poem, evaluated on both token-level and word-level accuracy.

Model	Rhyming poems (Token)	Rhyming poems (Word)	Non-rhyming poems (Token)	Non-rhyming poems (Word)	Overall (Token)	Overall (Word)
Base Model	6.11	5.43	19.35	17.95	6.51	7.24
Alberti	2.40	1.46	0.78	0.00	1.34	2.27
Saba	21.72	20.29	33.06	30.77	22.69	21.19

Rhyme attention

Analyzes the model's internal computations by calculating the average attention score for the last tokens of previous lines (both rhyming and non-rhyming) to determine if it has developed specific mechanisms to detect and process rhymes.

Model	All endings	Rhyming endings
Base model	12.37	2.45
Alberti	6.55	0.88
Saba	18.96	7.07

Why Saba?

Following the tradition of giving Italian names to BERT models for the Italian language (see AlBERTo, GilBERTo, UmBERTo), we dedicate this model to the Italian poet and novelist Umberto Saba (9 March 1883 – 25 August 1957).

How to run the code

Clone the repository:

git clone https://github.com/mattiaferarrini/saba.git

Download the two datasets used for training:

The CSV of Wikisource Italian Poems from Kaggle,
The zip of Biblioteca Italiana from GitHub.

Unzip the biblitaliana.zip file and clean it:

python bibl_cleaner.py

Combine the two datasets into a single one:

python combiner.py italian_poems.json

Split this dataset into train, validation and test:

python splitter.py italian_poems.json

Run the code in the notebook saba.ipynb to train and evaluate the model step by step. The notebook is intended to be run on Google Colab. Modifying the first few cells is enough to be able to run it locally.
You can use the code in evaluations/ to evaluate the already trained model available on Hugging Face.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Saba 📜🖋🇮🇹

About the model

Evaluation

Test loss and pseudo-perplexity

Rhyme classification

Author attribution

Poem completion

Rhyme attention

Why Saba?

How to run the code

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
evaluations		evaluations
.gitignore		.gitignore
README.md		README.md
bibl_cleaner.py		bibl_cleaner.py
combiner.py		combiner.py
saba.ipynb		saba.ipynb
splitter.py		splitter.py

Folders and files

Latest commit

History

Repository files navigation

Saba 📜🖋🇮🇹

About the model

Evaluation

Test loss and pseudo-perplexity

Rhyme classification

Author attribution

Poem completion

Rhyme attention

Why Saba?

How to run the code

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages