Weigh your words-memory-based lemmatization for Middle Dutch

被引:21
作者
Kestemont, Mike [1 ]
Daelemans, Walter [1 ]
De Pauw, Guy [1 ]
机构
[1] Univ Antwerp, CLiPS Computat Linguist Grp, B-2000 Antwerp, Belgium
来源
LITERARY AND LINGUISTIC COMPUTING | 2010年 / 25卷 / 03期
关键词
SEARCH;
D O I
10.1093/llc/fqq011
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This article deals with the lemmatization of Middle Dutch literature. This text collection-like any other medieval corpus-is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling, containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can 'learn' intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.
引用
收藏
页码:287 / 301
页数:15
相关论文
共 30 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
[Anonymous], 1996, P 4 WORKSHOP VERY LA
[3]  
[Anonymous], P 1 WORKSH LANG TECH
[4]  
Chrupala Grzegorz, 2006, PROCESAMIENTO LENGUA, V37, P121
[5]  
Crystal D., 1997, LINGUISTICS S L
[6]  
Daelemans W, 2009, P INT C REC ADV NAT, P65
[7]  
DAELEMANS W, 2007, ILK RES GROUP TECHNI, V707
[8]  
Daelemans Walter, 2005, Memory-based Language Processing
[9]  
De Pauw G, 2008, LEXIKOS, V18, P303
[10]   MEASURES OF THE AMOUNT OF ECOLOGIC ASSOCIATION BETWEEN SPECIES [J].
DICE, LR .
ECOLOGY, 1945, 26 (03) :297-302