Skip to content

mainlp/dialects-text-vs-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects

This repository contains code and detailed results for

Verena Blaschke, Miriam Winkler, Barbara Plank. Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects. https://arxiv.org/abs/2510.07890

Please cite the paper if you use any of this data/code.

Repo structure

  • code: Check the README in this repo for details on downloading the data and executing the code.
  • data/{intents,topics}: The re-mapped data sets.
  • data/{intents,topics}/asr: The ASR transcriptions of the evaluation data, by ASR model.
  • predictions: Contains the intent classification predictions, in separate subfolders for ech set-up. Each filename encodes the LM, the maximum number of epochs, the batch size, the learning rate, the seed, and the test set.
  • scores/asr: The WER/CER for the ASR models. For the Bavarian evaluation sets, the columns with "STD" in the names calculate the WER/CER relative to the parallel German sentence rather than the original Bavarian reference. The _detailed files show scores for each sentence.
  • scores/{intents,topics}: The intent classification scores (unaggregated and aggregated over seeds).

All subfolders containing (automatic or gold-standard) transcriptions are in zip archives with the password MaiNLP so as to prevent potential inclusion in web-scraped datasets (cf. Jacovi et al., 2023). Unzip them to get the subfolders with the same name. Please also use a zip archive (or similar) if you re-distribute the transcriptions.

Licenses + links to the datasets

Known issues

The random seeds are not set properly. While the different runs are in fact seeded differently, the seed numbers in the prediction file names or de-aggregated results tables cannot be used to reproduce a run with exactly the same actual random seed.

About

Standard-to-dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects

Resources

Stars

Watchers

Forks

Contributors

Languages