Acquisition of morphology of an indic language from text corpus

被引：0

作者：

Sharma, Utpal ^{[1
,4
]}

Kalita, Jugal K. ^{[2
,5
]}

Das, Rajib K. ^{[3
,6
]}

机构：

[1] Department of Computer Science and Engineering, Tezpur University, Tezpur-784028, Assam

[2] Department of Computer Science, University of Colorado, Colorado Springs

[3] Department of Computer Science and Engineering, Calcutta University, Kolkata

来源：

ACM Transactions on Asian Language Information Processing | 2008年 / 7卷 / 03期

关键词：

Assamese; Indo-European languages; Machine learning; Morphology;

D O I：

10.1145/1386869.1386871

中图分类号：

学科分类号：

摘要：

This article describes an approach to unsupervised learning of morphology from an unannotated corpus for a highly inflectional Indo-European language called Assamese spoken by about 30 million people. Although Assamese is one of Indias national languages, it utterly lacks computational linguistic resources. There exists no prior computational work on this language spoken widely in northeast India. The work presented is pioneering in this respect. In this article, we discuss salient issues in Assamese morphology where the presence of a large number of suffixal determiners, sandhi, samas, and the propensity to use suffix sequences make approximately 50% of the words used in written and spoken text inflected. We implement methods proposed by Gaussier and Goldsmith on acquisition of morphological knowledge, and obtain F-measure performance below 60%. This motivates us to present a method more suitable for handling suffix sequences, enabling us to increase the F-measure performance of morphology acquisition to almost 70%. We describe how we build a morphological dictionary for Assamese from the text corpus. Using the morphological knowledge acquired and the morphological dictionary, we are able to process small chunks of data at a time as well as a large corpus. We achieve approximately 85% precision and recall during the analysis of small chunks of coherent text. © 2008 ACM.

引用

共 50 条

[1] Language Identification: A New Fast Algorithm to Identify the Language of a Text in a Multilingual Corpus
Gadri, Said
Moussaoui, Abdelouahab
Belabdelouahab-Fernini, Linda
2014 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2014, : 321 - 326
[2] Acquisition of noun formation based on data from one child from the Croatian Corpus of Child Language
Kuna, Zrinka
SUVREMENA LINGVISTIKA, 2022, 48 (93): : 85 - 104
[3] Initial acquisition of tense-aspect morphology in an artificial language
Mueller, Charles M.
SECOND LANGUAGE RESEARCH, 2018, 34 (04) : 517 - 538
[4] Acquisition of L2 morphology by adult language learners
Kimppa, Lilli
Shtyrov, Yury
Hut, Suzanne C. A.
Hedlund, Laura
Leminen, Miika
Leminen, Alina
CORTEX, 2019, 116 : 74 - 90
[5] Unsupervised incremental acquisition of a thematic corpus from the Web
Duclaye, F
Yvon, F
Collin, O
2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS, 2003, : 752 - 757
[6] Text Classification Based on Natural Language Processing and Machine Learning in Multi-Label Corpus
Yu, Haitao
Xiong, Feng
Chen, Zuh ui
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (08)
[7] The Hmong Medical Corpus: a biomedical corpus for a minority language
Nathan M. White
Language Resources and Evaluation, 2022, 56 : 1315 - 1332
[8] The Hmong Medical Corpus: a biomedical corpus for a minority language
White, Nathan M.
LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (04) : 1315 - 1332
[9] Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources
Jin Mao
Lisa R. Moore
Carrine E. Blank
Elvis Hsin-Hui Wu
Marcia Ackerman
Sonali Ranade
Hong Cui
BMC Bioinformatics, 17
[10] Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources
Mao, Jin
Moore, Lisa R.
Blank, Carrine E.
Wu, Elvis Hsin-Hui
Ackerman, Marcia
Ranade, Sonali
Cui, Hong
BMC BIOINFORMATICS, 2016, 17

← 1 2 3 4 5 →