Acquisition of morphology of an indic language from text corpus

被引:0
|
作者
Sharma, Utpal [1 ,4 ]
Kalita, Jugal K. [2 ,5 ]
Das, Rajib K. [3 ,6 ]
机构
[1] Department of Computer Science and Engineering, Tezpur University, Tezpur-784028, Assam
[2] Department of Computer Science, University of Colorado, Colorado Springs
[3] Department of Computer Science and Engineering, Calcutta University, Kolkata
来源
ACM Transactions on Asian Language Information Processing | 2008年 / 7卷 / 03期
关键词
Assamese; Indo-European languages; Machine learning; Morphology;
D O I
10.1145/1386869.1386871
中图分类号
学科分类号
摘要
This article describes an approach to unsupervised learning of morphology from an unannotated corpus for a highly inflectional Indo-European language called Assamese spoken by about 30 million people. Although Assamese is one of Indias national languages, it utterly lacks computational linguistic resources. There exists no prior computational work on this language spoken widely in northeast India. The work presented is pioneering in this respect. In this article, we discuss salient issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhi, samas, and the propensity to use suffix sequences make approximately 50% of the words used in written and spoken text inflected. We implement methods proposed by Gaussier and Goldsmith on acquisition of morphological knowledge, and obtain F-measure performance below 60%. This motivates us to present a method more suitable for handling suffix sequences, enabling us to increase the F-measure performance of morphology acquisition to almost 70%. We describe how we build a morphological dictionary for Assamese from the text corpus. Using the morphological knowledge acquired and the morphological dictionary, we are able to process small chunks of data at a time as well as a large corpus. We achieve approximately 85% precision and recall during the analysis of small chunks of coherent text. © 2008 ACM.
引用
收藏
相关论文
共 50 条
  • [21] Access Control Policy Extraction from Unconstrained Natural Language Text
    Slankas, John
    Williams, Laurie
    2013 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM), 2013, : 435 - 440
  • [22] From "Gestural Language" to "Language Gesture": Andre Jolles, Aby Warburg, and the Morphology of Mass Media
    Nee, David
    MODERN LANGUAGE QUARTERLY, 2023, 84 (04): : 413 - 442
  • [23] Constructivist Approaches to First Language Acquisition
    Behrens, Heike
    JOURNAL OF CHILD LANGUAGE, 2021, 48 (05) : 959 - 983
  • [24] Natural language inference for curation of structured clinical registries from unstructured text
    Percha, Bethany
    Pisapati, Kereeti
    Gao, Cynthia
    Schmidt, Hank
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2021, 29 (01) : 97 - 108
  • [25] A corpus-based approach to geographical focus detection in text
    Peregrino, Fernando S.
    Tomas, David
    Llopis, Fernando
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2013, (50): : 69 - 76
  • [26] Sign Language Video Generation from Text Using Generative Adversarial Networks
    Sreemathy, R.
    Chordiya, Param
    Khurana, Soumya
    Turuk, Mousami
    OPTICAL MEMORY AND NEURAL NETWORKS, 2024, 33 (04) : 466 - 476
  • [27] A Systematic Mapping Study of Language Features Identification from Large Text Collection
    Mati, Diellza Nagavci
    Hamiti, Mentor
    Ajdari, Jaumin
    Selimi, Besnik
    Raufi, Bujar
    2019 8TH MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING (MECO), 2019, : 242 - 246
  • [28] Universal Grammar in Second Language Acquisition
    Penke, Martina
    ZEITSCHRIFT FUR SPRACHWISSENSCHAFT, 2009, 28 (01): : 87 - 96
  • [29] Morphology in Statistical Machine Translation from English to a Highly Inflectional Language
    Maucec, Mirjam S.
    Donaj, Gregor
    INFORMATION TECHNOLOGY AND CONTROL, 2018, 47 (01): : 63 - 74
  • [30] Morphology, language and the brain: the decompositional substrate for language comprehension
    Marslen-Wilson, William D.
    Tyler, Lorraine K.
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 2007, 362 (1481) : 823 - 836