Utilizing constituent structure for compound analysis

被引:0
作者
Dadason, Jon Fridrik [1 ]
Bjarnadottir, Kristin [1 ]
机构
[1] Univ Iceland, Arni Magnusson Inst Iceland Studies, IS-101 Reykjavik, Iceland
来源
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2014年
关键词
decompounding; constituent structure; Icelandic compounds;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. The tool described in this paper splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modem Icelandic Inflection, and word frequencies from Islenskur oroasjoour, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by comparison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split returned by the decompounder is important in tasks such as semantic analysis or machine translation, where a flat (non-structured) sequence of constituents is insufficient.
引用
收藏
页码:1637 / 1641
页数:5
相关论文
共 13 条
[1]  
[Anonymous], ISL OR
[2]  
Bjarnadottir K., 2002, SHORT DESCRIPTION IC
[3]  
Bjarnadottir Kristin, 2012, P WORKSH LANG TECHN, P13
[4]  
Bjarnadottir Kristin, DATABASE MODERN ICEL
[5]  
Bjarnadottir Kristin, 2005, AFLEIOSLA SAMSETNING
[6]  
Braschler M., 2003, ADV CROSS LANGUAGE I, V2785, P164
[7]  
Brown R. D., 2002, P 9 INT C THEOR METH
[8]  
Hallsteinsdottir E., 2007, P NODALIDA 07 TART E
[9]  
Hedlund T., 2001, 2 WORKSH CROSS LANG
[10]  
Koelm P., 2003, P 10 C EUR CHAPT ASS, V1