Towards Better Text Processing Tools for the Ainu Language

被引:0
|
作者
Nowakowski, Karol [1 ]
Ptaszynski, Michal [1 ]
Masui, Fumito [1 ]
机构
[1] Kitami Inst Technol, Dept Comp Sci, 165 Koen Cho, Kitami, Hokkaido 0908507, Japan
来源
HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017 | 2020年 / 12598卷
关键词
Ainu language; Endangered languages; Under-resourced languages; Transcription normalization; Word segmentation; Tokenization; Part-of-speech tagging;
D O I
10.1007/978-3-030-66527-2_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present our research devoted to the development of Natural Language Processing technologies for the Ainu language, a critically endangered language isolate spoken by the Ainu people, the native inhabitants of northern parts of the Japanese archipelago. In particular, we focused on improving the existing tools for transcription normalization, word segmentation (tokenization) and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments confirmed the positive effect of these modifications on the overall performance of the tools, especially with objective samples unrelated to the training data. We also discuss further improvements obtained by applying corpus-driven language models to the problem of word segmentation and using a state-of-the-art tool for training part-of-speech taggers.
引用
收藏
页码:131 / 145
页数:15
相关论文
共 50 条
  • [21] Word segmentation in Chinese language processing
    Shu, Xinxin
    Wang, Junhui
    Shen, Xiaotong
    Qu, Annie
    Statistics and Its Interface, 2017, 10 (02) : 165 - 173
  • [22] Translating Speech to Indian Sign Language Using Natural Language Processing
    Sharma, Purushottam
    Tulsian, Devesh
    Verma, Chaman
    Sharma, Pratibha
    Nancy, Nancy
    FUTURE INTERNET, 2022, 14 (09)
  • [23] Scene Text Detection & Language Translation Using SWT and EBMT
    Khandait, S.
    Khandait, P.
    Jambhulkar, P.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING 2016 (ICCASP 2016), 2017, 137 : 547 - 555
  • [24] Overview and Application of Text Data Pre-Processing Techniques for Text Mining on Health News Tweets
    Chaudhary, Gauri
    Kshirsagar, Manali
    HELIX, 2018, 8 (05): : 3764 - 3768
  • [25] Backdoors Against Natural Language Processing: A Review
    Li, Shaofeng
    Dong, Tian
    Zhao, Benjamin
    Xue, Jason
    Du, Suguo
    Zhu, Haojin
    IEEE SECURITY & PRIVACY, 2022, 20 (05) : 50 - 59
  • [26] Related Blogs' Summarization With Natural Language Processing
    Baliyan, Niyati
    Sharma, Aarti
    COMPUTER JOURNAL, 2021, 64 (03) : 347 - 357
  • [27] IceNLP: A Natural Language Processing Toolkit for Icelandic
    Loftsson, Hrafn
    Rognvaldsson, Eirikur
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 717 - +
  • [28] Professional language in Swedish clinical text: Linguistic characterization and comparative studies
    Smith, Kelly
    Megyesi, Beata
    Velupillai, Sumithra
    Kvist, Maria
    NORDIC JOURNAL OF LINGUISTICS, 2014, 37 (02) : 297 - 323
  • [29] TaLAPi - A Thai Linguistically Annotated Corpus for Language Processing
    Aw, AiTi
    Aljunied, Sharifah Mahani
    Lertcheva, Nattadaporn
    Kalunsima, Sasiwimon
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [30] Professional Chat Application based on Natural Language Processing
    Karthick, S.
    Victor, R. John
    Manikandan, S.
    Goswami, Bhargavi
    2018 IEEE INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN ADVANCED COMPUTING (ICCTAC), 2018,