Text_Analytics-course_project__2023

The aim of the project was to select a linguistic dataset and:

perform analyzes on linguistic data;
train classifiers (from the more traditional models like Support Vector Classifier to the more complex LLMs like BertForSequenceClassification) in order to perform a single-label multiclass classification.

This task was inspired by the shared task on SemEval2023 "Detecting the gender, the framing, and the persuasion techniques in online news in a multi-lingual setup" (https://propaganda.math.unipd.it/semeval2023task3/), from whose data the necessary dataset was built in order to perform the classification of different types of persuasive communication present in the texts, for a total of 7 types.

The folder is composed as follows:

Inside the generation folder there is the notebook that allowed the creation of the dataset definitively used in order to carry out the task;
Three jupyter notebook files containing:
- text exploration using NLTK library methods and performing various tasks such as entity name recognition and sentiment analysis;
- the application of different non-LLM classification algorithms and the application of different feature extraction methods in order to create a vector representation of texts;
- the application of the BertForSequenceClassification (LLM) classification model.
A pdf file containing the project report and the presentation of the results obtained.

The project was carried out in collaboration with Giulio Canapa, Diego Borsetto and Sara Quattrone.

############################################################################

Lo scopo del progetto era selezionare un dataset di dati linguistici e:

svolgere delle analisi sui dati linguitici;
addestrare classificatori (dai più tradizionali modelli come Support Vector Classifier ai più complessi LLM come BertForSequenceClassification) al fine di effettuare una classificazione multi-class single-label.

Questa task è stata ispirata dalla task condivisa su SemEval2023 "Detecting the genre, the framing, and the persuasion techniques in online news in a multi-lingual setup" (https://propaganda.math.unipd.it/semeval2023task3/), dai cui dati è stato costrutito il dataset necessario al fine di eseguire la classificazione di diversi tipi di comunicazione persuasoria presenti nei testi, per un totale di 7 tipi.

La cartella è stata così composta:

All'interno della cartella generazione è presente il notebook che ha permesso la creazione del dataset definitivamente utilizzato al fine di svolgere il task;
Tre file jupyter notebook che contengono:
- l'esplorazione dei testi tramite metodi di libreria NLTK e lo svolgimento di diverse task come la name entity recogniton e sentiment analysis;
- l'applicazione di divesi algoritmi di classificazione (non LLM) e l'applicazione di diversi metodi per l'estrazione di features al fine di creare una rappresentazione vettoriale dei testi;
- l'applicazione del modello di classificazione BertForSequenceClassification (LLM).
Un file pdf che contiene la relazione di progetto e la presentazione dei risultati ottenuti.

Il progetto è stato svolto in collaborazione con Giulio Canapa, Diego Borsetto and Sara Quattrone.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
generazione		generazione
Group4TXA_BERTSequenceClassifier.ipynb		Group4TXA_BERTSequenceClassifier.ipynb
Group4TXA_Classification.ipynb		Group4TXA_Classification.ipynb
Group4TXA_Text_Exploration.ipynb		Group4TXA_Text_Exploration.ipynb
README.md		README.md
TextAnalyticsReportGroup4.pdf		TextAnalyticsReportGroup4.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text_Analytics-course_project__2023

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text_Analytics-course_project__2023

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages