Improving word coverage using unsupervised morphological analyser

被引:1
作者
Sunitha, K. V. N. [1 ]
Kalyani, N. [1 ]
机构
[1] G Narayanamma Inst Technol & Sci, Dept Comp Sci & Engn, Hyderabad 500008, Andhra Pradesh, India
来源
SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES | 2009年 / 34卷 / 05期
关键词
Human languages; unsupervised morphological analyser; clustering; morphological segmentation; linguistic research;
D O I
10.1007/s12046-009-0041-x
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Powerful computers are needed for processing tasks related to human languages these days. Human languages, also called natural languages, are highly versatile systems of encoding information and can capture information of various domains. To enable a computer to process information in human languages, the language needs to be appropriately 'described' to the computer, i.e. the language needs to be 'modelled'. In this work, we present an approach for acquisition of morphology of inflectional language like Hindi. It is an unsupervised learning approach, suitable for languages with a rich concatenative morphology. Broadly, our work is carried out in three steps: 1. Acquire the morphology of Hindi from a raw (un annotated) Central Institute of Indian Languages (CIIL), Mysore text corpus, 2. prepare clusters and prepare stem bag and suffix bag, 3. use the morphological knowledge to decompose given word as stems and suffixes according to their morphological behaviour and add new words. A prime motivation behind this work is to eventually develop an unsupervised morphological analyser which is language-independent (used for Hindi). Second motivation is to develop a Morphological segmentation which is language-independent as it is shown that study of morphology would benefit to a range of NLP tasks such as speech recognition, speech synthesis, machine translation and information retrieval. Though Hindi is an important and a national language in India, little computational work has been done so far in this direction. Our work is one of the first efforts in this regard and can be considered pioneering. There are many such languages for which it is very important to have a suitable but inexpensive computational acquisition process. Languages receive very little attention of computational linguistic research both in terms of availability of funds and number of researchers. We however do not claim that our approach is a solution for all such languages. Different languages have characteristics that require individual research attention.
引用
收藏
页码:703 / 715
页数:13
相关论文
共 16 条
[1]  
CHEN J, 2000, P 18 INT C COMP LING
[2]  
Creutz M, 2003, 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P280
[3]  
CREUTZ M, 2002, P 6 WORKSH ACL SPEC
[4]  
DEJEAN H, 1998, WORKSH PAR GROUND NA, P295
[5]  
ERIC G, 1999, ACL 99 WORKSH P UNS, P24
[6]   Unsupervised learning of the morphology of a natural language [J].
Goldsmith, J .
COMPUTATIONAL LINGUISTICS, 2001, 27 (02) :153-198
[7]   Measuring state legislative committee power: Change and chamber differences in the 20th century [J].
Hamm, Keith E. ;
Hedlund, Ronald D. ;
Martorano, Nancy .
STATE POLITICS & POLICY QUARTERLY, 2006, 6 (01) :88-111
[8]   The German-American experience [J].
Johnson, CT .
MICHIGAN HISTORICAL REVIEW, 2001, 27 (01) :193-194
[9]  
JURAFSKY D, 2000, P CONLL 2000 LLL 200, P67
[10]  
Keshava S., 2006, P 2 PASCAL CHALLENGE, P31