Code, data, and trained models for the r/italy COVID-19 Usage Change Corpus. A corpus was created by scraping text from submissions on the Italian subreddit between Jan 30 and Nov 30 of 2019 and 2020. Scraping was done with praw and psaw
The data was lemmatized and preprocessed with Stanza and analyzed with the method from Gonen et al. 2020 to detect short-term usage change in Italian between 2019 and 2020.
Raw and preprocessed data can be downloaded here
The output of the usage change detection algorithm by Gonen et al. 2020 is saved in the file "detect_2019_2020_.txt". This is the outcome for the lemmatized corpora.
The data were visualized with the Embedding Projector. The files are available in the models directory. To visualize the data, load tensors_[year].tsv and tensors_[year]_meta.tsv in the projector. You can then run one of the dimensionality reduction algorithms provided by the tool, or load tensors_[year]_bookmark.txt to use the already labeled one (t-SNE, 10000 iterations).