Acquisition of morphology of an indic language from text corpus

被引:0
|
作者
Sharma, Utpal [1 ,4 ]
Kalita, Jugal K. [2 ,5 ]
Das, Rajib K. [3 ,6 ]
机构
[1] Department of Computer Science and Engineering, Tezpur University, Tezpur-784028, Assam
[2] Department of Computer Science, University of Colorado, Colorado Springs
[3] Department of Computer Science and Engineering, Calcutta University, Kolkata
来源
ACM Transactions on Asian Language Information Processing | 2008年 / 7卷 / 03期
关键词
Assamese; Indo-European languages; Machine learning; Morphology;
D O I
10.1145/1386869.1386871
中图分类号
学科分类号
摘要
This article describes an approach to unsupervised learning of morphology from an unannotated corpus for a highly inflectional Indo-European language called Assamese spoken by about 30 million people. Although Assamese is one of Indias national languages, it utterly lacks computational linguistic resources. There exists no prior computational work on this language spoken widely in northeast India. The work presented is pioneering in this respect. In this article, we discuss salient issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhi, samas, and the propensity to use suffix sequences make approximately 50% of the words used in written and spoken text inflected. We implement methods proposed by Gaussier and Goldsmith on acquisition of morphological knowledge, and obtain F-measure performance below 60%. This motivates us to present a method more suitable for handling suffix sequences, enabling us to increase the F-measure performance of morphology acquisition to almost 70%. We describe how we build a morphological dictionary for Assamese from the text corpus. Using the morphological knowledge acquired and the morphological dictionary, we are able to process small chunks of data at a time as well as a large corpus. We achieve approximately 85% precision and recall during the analysis of small chunks of coherent text. © 2008 ACM.
引用
收藏
相关论文
共 50 条
  • [31] Topic Modeling Techniques for Text Mining over a Large-Scale Scientific and Biomedical Text Corpus
    Avasthi S.
    Chauhan R.
    Acharjya D.P.
    International Journal of Ambient Computing and Intelligence, 2022, 13 (01)
  • [32] The role of the corpus callosum in language network connectivity in children
    Bartha-Doering, Lisa
    Kollndorfer, Kathrin
    Schwartz, Ernst
    Fischmeister, Florian Ph S.
    Alexopoulos, Johanna
    Langs, Georg
    Prayer, Daniela
    Kasprian, Gregor
    Seidl, Rainer
    DEVELOPMENTAL SCIENCE, 2021, 24 (02)
  • [33] Detection of fake news in a new corpus for the Spanish language
    Posadas-Duran, Juan-Pablo
    Gomez-Adorno, Helena
    Sidorov, Grigori
    Moreno Escobar, Jesus Jaime
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4869 - 4876
  • [34] Predictors of Second Language Acquisition in Latino Children With Specific Language Impairment
    Gutierrez-Clellen, Vera
    Simon-Cereijido, Gabriela
    Sweet, Monica
    AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY, 2012, 21 (01) : 64 - 77
  • [35] Language-general and language-specific phenomena in the acquisition of inflectional noun morphology: A cross-linguistic elicited-production study of Polish, Finnish and Estonian
    Granlund, Sonia
    Kolak, Joanna
    Vihman, Virve
    Engelmann, Felix
    Lieven, Elena V. M.
    Pine, Julian M.
    Theakston, Anna L.
    Ambridge, Ben
    JOURNAL OF MEMORY AND LANGUAGE, 2019, 107 : 169 - 194
  • [36] SOD: A Corpus for Saudi Offensive Language Detection Classification
    Asiri, Afefa
    Saleh, Mostafa
    COMPUTERS, 2024, 13 (08)
  • [37] Automatic Extraction of Engineering Rules From Unstructured Text: A Natural Language Processing Approach
    Ye, Xinfeng
    Lu, Yuqian
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2020, 20 (03)
  • [38] Text Readability for Arabic as a Foreign Language What performance to expect from simple predictors?
    Saddiki, Hind
    Bouzoubaa, Karim
    Cavalli-Sforza, Violetta
    2015 IEEE/ACS 12TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2015,
  • [39] Behavioral and computational aspects of language and its acquisition
    Edelman, Shimon
    Waterfall, Heidi
    PHYSICS OF LIFE REVIEWS, 2007, 4 (04) : 253 - 277
  • [40] Expansion of the SyllabO plus corpus and database: Words, lemmas, and morphology
    Auclair-Ouellet, Noemie
    Lavoie, Alexandra
    Bedard, Pascale
    Barbeau-Morrison, Alexandra
    Drouin, Patrick
    Tremblay, Pascale
    BEHAVIOR RESEARCH METHODS, 2025, 57 (01)