Acquisition of morphology of an indic language from text corpus

被引:0
|
作者
Sharma, Utpal [1 ,4 ]
Kalita, Jugal K. [2 ,5 ]
Das, Rajib K. [3 ,6 ]
机构
[1] Department of Computer Science and Engineering, Tezpur University, Tezpur-784028, Assam
[2] Department of Computer Science, University of Colorado, Colorado Springs
[3] Department of Computer Science and Engineering, Calcutta University, Kolkata
来源
ACM Transactions on Asian Language Information Processing | 2008年 / 7卷 / 03期
关键词
Assamese; Indo-European languages; Machine learning; Morphology;
D O I
10.1145/1386869.1386871
中图分类号
学科分类号
摘要
This article describes an approach to unsupervised learning of morphology from an unannotated corpus for a highly inflectional Indo-European language called Assamese spoken by about 30 million people. Although Assamese is one of Indias national languages, it utterly lacks computational linguistic resources. There exists no prior computational work on this language spoken widely in northeast India. The work presented is pioneering in this respect. In this article, we discuss salient issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhi, samas, and the propensity to use suffix sequences make approximately 50% of the words used in written and spoken text inflected. We implement methods proposed by Gaussier and Goldsmith on acquisition of morphological knowledge, and obtain F-measure performance below 60%. This motivates us to present a method more suitable for handling suffix sequences, enabling us to increase the F-measure performance of morphology acquisition to almost 70%. We describe how we build a morphological dictionary for Assamese from the text corpus. Using the morphological knowledge acquired and the morphological dictionary, we are able to process small chunks of data at a time as well as a large corpus. We achieve approximately 85% precision and recall during the analysis of small chunks of coherent text. © 2008 ACM.
引用
收藏
相关论文
共 50 条
  • [41] Grammar and the lexicon: Developmental ordering in language acquisition
    Dixon, James A.
    Marchman, Virginia A.
    CHILD DEVELOPMENT, 2007, 78 (01) : 190 - 212
  • [42] How diverse is child language acquisition research?
    Kidd, Evan
    Garcia, Rowena
    FIRST LANGUAGE, 2022, 42 (06) : 703 - 735
  • [43] Spelling errors respect morphology: a corpus study of Hebrew orthography
    Amalia Bar-On
    Victor Kuperman
    Reading and Writing, 2019, 32 : 1107 - 1128
  • [44] Spelling errors respect morphology: a corpus study of Hebrew orthography
    Bar-On, Amalia
    Kuperman, Victor
    READING AND WRITING, 2019, 32 (05) : 1107 - 1128
  • [45] PLURAL ACQUISITION IN CHILDREN WITH SPECIFIC LANGUAGE IMPAIRMENT
    OETTING, JB
    RICE, ML
    JOURNAL OF SPEECH AND HEARING RESEARCH, 1993, 36 (06): : 1236 - 1248
  • [46] A comparative analysis of text classification for Turkish language
    Yildirim, Savas
    Yildiz, Tugba
    PAMUKKALE UNIVERSITY JOURNAL OF ENGINEERING SCIENCES-PAMUKKALE UNIVERSITESI MUHENDISLIK BILIMLERI DERGISI, 2018, 24 (05): : 879 - 886
  • [47] Natural language processing for Nepali text: a review
    Tej Bahadur Shahi
    Chiranjibi Sitaula
    Artificial Intelligence Review, 2022, 55 : 3401 - 3429
  • [48] Techniques of Czech Language Lossless Text Compression
    Sevcik, Jiri
    Dvorsky, Jiri
    COMPUTER INFORMATION SYSTEMS AND INDUSTRIAL MANAGEMENT, CISIM 2016, 2016, 9842 : 265 - 276
  • [49] Text mining and natural language processing in construction
    Shamshiri, Alireza
    Ryu, Kyeong Rok
    Park, June Young
    AUTOMATION IN CONSTRUCTION, 2024, 158
  • [50] The role of derivational morphology in vocabulary acquisition: Get by with a little help from my morpheme friends
    Bertram, R
    Laine, M
    Virkkala, MM
    SCANDINAVIAN JOURNAL OF PSYCHOLOGY, 2000, 41 (04) : 287 - 296