Digitising Swiss German: how to process and study a polycentric spoken language

被引:10
作者
Scherrer, Yves [1 ]
Samardzic, Tanja [2 ]
Glaser, Elvira [3 ]
机构
[1] Univ Helsinki, Dept Digital Humanities, Helsinki, Finland
[2] Univ Zurich, Language & Space Lab, URPP Language & Space, Zurich, Switzerland
[3] Univ Zurich, Dept German, Zurich, Switzerland
关键词
Swiss German; Corpus; Non-standard language; Spoken language; Normalisation; Speech-to-text alignment; Word level annotation; Dialectology; Dialectometry; Oral history;
D O I
10.1007/s10579-019-09457-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.
引用
收藏
页码:735 / 769
页数:35
相关论文
共 76 条
  • [1] Morphological Inflection Generation with Hard Monotonic Attention
    Aharoni, Roee
    Goldberg, Yoav
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 2004 - 2015
  • [2] [Anonymous], 2012, P 2 WORKSHOP ANNOTAT
  • [3] [Anonymous], 2018, P 5 WORKSHOP NLP SIM
  • [4] [Anonymous], 2013, P 19 NORDIC C COMPUT
  • [5] [Anonymous], 2017, Computational Linguistics in the Netherlands Journal,
  • [6] [Anonymous], 1989, DIALEKTGEOGRAPHIE DI
  • [7] [Anonymous], 2014, P 8 WORKSHOP LANGUAG, DOI 10.3115/v1/W14-0605
  • [8] [Anonymous], 1986, SCHWYZERTUTSCHI DIAL
  • [9] [Anonymous], 1999, 1999 JOINT SIGDAT C
  • [10] [Anonymous], 2010, P EMNLP CAMBR MA