Towards Better Text Processing Tools for the Ainu Language

被引:0
|
作者
Nowakowski, Karol [1 ]
Ptaszynski, Michal [1 ]
Masui, Fumito [1 ]
机构
[1] Kitami Inst Technol, Dept Comp Sci, 165 Koen Cho, Kitami, Hokkaido 0908507, Japan
来源
HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017 | 2020年 / 12598卷
关键词
Ainu language; Endangered languages; Under-resourced languages; Transcription normalization; Word segmentation; Tokenization; Part-of-speech tagging;
D O I
10.1007/978-3-030-66527-2_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present our research devoted to the development of Natural Language Processing technologies for the Ainu language, a critically endangered language isolate spoken by the Ainu people, the native inhabitants of northern parts of the Japanese archipelago. In particular, we focused on improving the existing tools for transcription normalization, word segmentation (tokenization) and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments confirmed the positive effect of these modifications on the overall performance of the tools, especially with objective samples unrelated to the training data. We also discuss further improvements obtained by applying corpus-driven language models to the problem of word segmentation and using a state-of-the-art tool for training part-of-speech taggers.
引用
收藏
页码:131 / 145
页数:15
相关论文
共 50 条
  • [1] Improving Basic Natural Language Processing Tools for the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    Momouchi, Yoshio
    INFORMATION, 2019, 10 (11)
  • [2] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    INFORMATION, 2019, 10 (10)
  • [3] Towards Better Language Modeling for Thai LVCSR
    Jongtaveesataporn, Markpong
    Thienlikit, Issara
    Wutiwiwatchai, Chai
    Furui, Sadaoki
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 317 - +
  • [4] PermA and Balloon: Tools for string alignment and text processing
    Reichel, Uwe D.
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1872 - 1875
  • [5] Building natural language processing tools for Runyakitara
    Katushemererwe, Fridah
    Caines, Andrew
    Buttery, Paula
    APPLIED LINGUISTICS REVIEW, 2021, 12 (04) : 585 - 609
  • [6] Arabic text preprocessing for the natural language processing applications
    Awajan, Arafat
    ARAB GULF JOURNAL OF SCIENTIFIC RESEARCH, 2007, 25 (04): : 179 - 189
  • [7] Low-level natural language technique for arabic text processing
    Awajan, A
    COMPUTERS AND THEIR APPLICATIONS, 2001, : 387 - 390
  • [8] Part-of-speech tagger for Ainu language based on higher order Hidden Markov Model
    Ptaszynski, Michal
    Momouchi, Yoshio
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (14) : 11576 - 11582
  • [9] Text Processing
    Couto, Francisco M.
    DATA AND TEXT PROCESSING FOR HEALTH AND LIFE SCIENCES, 2019, 1137 : 45 - 60
  • [10] Towards better transition modeling in recurrent neural networks: The case of sign language tokenization
    Poitier, Pierre
    Fink, Jerome
    Frenay, Benoit
    NEUROCOMPUTING, 2024, 567