Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects

This repository contains code and detailed results for

Verena Blaschke, Miriam Winkler, Barbara Plank. Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects. https://arxiv.org/abs/2510.07890

Please cite the paper if you use any of this data/code.

Repo structure

code: Check the README in this repo for details on downloading the data and executing the code.
data/{intents,topics}: The re-mapped data sets.
data/{intents,topics}/asr: The ASR transcriptions of the evaluation data, by ASR model.
predictions: Contains the intent classification predictions, in separate subfolders for ech set-up. Each filename encodes the LM, the maximum number of epochs, the batch size, the learning rate, the seed, and the test set.
scores/asr: The WER/CER for the ASR models. For the Bavarian evaluation sets, the columns with "STD" in the names calculate the WER/CER relative to the parallel German sentence rather than the original Bavarian reference. The _detailed files show scores for each sentence.
scores/{intents,topics}: The intent classification scores (unaggregated and aggregated over seeds).

All subfolders containing (automatic or gold-standard) transcriptions are in zip archives with the password MaiNLP so as to prevent potential inclusion in web-scraped datasets (cf. Jacovi et al., 2023). Unzip them to get the subfolders with the same name. Please also use a zip archive (or similar) if you re-distribute the transcriptions.

Licenses + links to the datasets

MASSIVE: https://github.com/alexa/massive, Apache 2.0
Speech-MASSIVE: https://github.com/hlt-mt/Speech-MASSIVE, CC BY-NC-SA 4.0
xSID: https://github.com/mainlp/xsid, CC BY-SA 4.0
MAS:de-ba: https://github.com/mainlp/NaLiBaSID
SwissDial: https://mtc.ethz.ch/publications/open-source/swiss-dial.html, CC BY-NC 4.0
xSID-audio: link to come (at the latest when the paper is published), license TBD
The code in this repo: TBD

Known issues

The random seeds are not set properly. While the different runs are in fact seeded differently, the seed numbers in the prediction file names or de-aggregated results tables cannot be used to reproduce a run with exactly the same actual random seed.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
README.md		README.md
data.zip		data.zip
predictions.zip		predictions.zip
scores.zip		scores.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects

Repo structure

Licenses + links to the datasets

Known issues

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Standard-to-Dialect transfer trends differ across text and speech: A case study on intent and topic classification in German dialects

Repo structure

Licenses + links to the datasets

Known issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages