An automatic part-of-speech tagger for Middle Low German

被引:2
作者
Koleva, Mariya [1 ]
Farasyn, Melissa [2 ]
Desmet, Bart [1 ]
Breitbarth, Anne [2 ]
Hoste, Veronique [1 ]
机构
[1] Univ Ghent, Language & Translat Technol Team LT3, Groot Brittannielaan 45, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Linguist IaLing, Blandijnberg 2, B-9000 Ghent, Belgium
关键词
historical linguistics; part-of-speech tagging; conditional random fields; feature selection; normalization;
D O I
10.1075/ijcl.22.1.05kol
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.
引用
收藏
页码:107 / 140
页数:34
相关论文
共 52 条
  • [41] Schmid Helmut, 2008, P 22 INT C COMPUTATI, P777, DOI DOI 10.3115/1599081.1599179
  • [42] Parsing early and late modern English corpora
    Schneider, Gerold
    Lehmann, Hans Martin
    Schneider, Peter
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2015, 30 (03) : 423 - 439
  • [43] Schroder Ingrid, 2014, JB GERMANISTISCHE SP, V5, P150
  • [44] Multimodular Text Normalization of Dutch User-Generated Content
    Schulz, Sarah
    De Pauw, Guy
    De Clercq, Orphee
    Desmet, Bart
    Hoste, Veronique
    Daelemans, Walter
    Macken, Lieve
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2016, 7 (04)
  • [45] Silfverberg M, 2014, PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P259
  • [46] Tophinke D, 2009, REIHE GER LINGUIST, V283, P161
  • [47] Tophinke Doris., 2012, NIEDERDEUTSCHES WORT, V52, P19
  • [48] Tophinke Doris u, 2011, SPRACHVARIATION SPRA, P97
  • [49] Van de Kauter M., 2013, COMPUT LINGUIST, V3, P103
  • [50] The HeliPaD A parsed corpus of Old Saxon
    Walkden, George
    [J]. International Journal of Corpus Linguistics, 2016, 21 (04) : 559 - 571