Data Augmentation Methods for Low-Resource Orthographic Syllabification

被引:3
|
作者
Suyanto, Suyanto [1 ]
Lhaksmana, Kemas M. [1 ]
Bijaksana, Moch Arif [1 ]
Kurniawan, Adriana [1 ]
机构
[1] Telkom Univ, Sch Comp, Bandung 40257, Indonesia
关键词
Indonesian; flipping onsets; orthographic syllabification; swapping consonant-graphemes; transposing nuclei; LANGUAGE; MODEL;
D O I
10.1109/ACCESS.2020.3015778
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An n-gram syllabification model generally produces a high error rate for a low-resource language, such as Indonesian, because of the high rate of out-of-vocabulary (OOV) n-grams. In this paper, a combination of three methods of data augmentations is proposed to solve the problem, namely swapping consonant-graphemes, flipping onsets, and transposing nuclei. An investigation on 50k Indonesian words shows that the combination of three data augmentation methods drastically increases the amount of both unigrams and bigrams. A previous procedure of flipping onsets has been proven to enhance the standard bigram-syllabification by relatively decreasing the syllable error rate (SER) by up to 18.02%. Meanwhile, the previous swapping consonant-graphemes has been proven to give a relative decrement of SER up to 31.39%. In this research, a new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams. An evaluation based on k-fold cross-validation (k-FCV), using k = 5, for 50 thousand Indonesian formal words concludes that the proposed combination of the three procedures relatively decreases the mean SER produced by the standard bigram model by up to 37.63%. The proposed model is comparable to the fuzzy k-nearest neighbor in every class (FkNNC)-based model. It is worse than the state-of-the-art model, which is developed using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF), but it offers a low complexity.
引用
收藏
页码:147399 / 147406
页数:8
相关论文
共 50 条
  • [1] Generalized Data Augmentation for Low-Resource Translation
    Xia, Mengzhou
    Kong, Xiang
    Anastasopoulos, Antonios
    Neubig, Graham
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5786 - 5796
  • [2] Data Augmentation for Low-Resource Keyphrase Generation
    Garg, Krishna
    Chowdhury, Jishnu Ray
    Caragea, Cornelia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8442 - 8455
  • [3] Combining Simple but Novel Data Augmentation Methods for Improving Low-Resource ASR
    Damania, Ronit
    Homan, Christopher
    Prud'hommeaux, Emily
    INTERSPEECH 2022, 2022, : 4890 - 4894
  • [4] Combining Simple but Novel Data Augmentation Methods for Improving Low-Resource ASR
    Damania, Ronit
    Homan, Christopher
    Prud'hommeaux, Emily
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 4890 - 4894
  • [5] Data Augmentation for Low-Resource Quechua ASR Improvement
    Zevallos, Rodolfo
    Bel, Nuria
    Cambara, Guillermo
    Farrus, Mireia
    Luque, Jordi
    INTERSPEECH 2022, 2022, : 3518 - 3522
  • [6] SYNTHETIC DATA AUGMENTATION FOR IMPROVING LOW-RESOURCE ASR
    Thai, Bao
    Jimerson, Robert
    Arcoraci, Dominic
    Prud'hommeaux, Emily
    Ptucha, Raymond
    2019 IEEE WESTERN NEW YORK IMAGE AND SIGNAL PROCESSING WORKSHOP (WNYISPW), 2019,
  • [7] Data Augmentation for Low-Resource Neural Machine Translation
    Fadaee, Marzieh
    Bisazza, Arianna
    Monz, Christof
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 567 - 573
  • [8] MIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECH RECOGNITION
    Meng, Linghui
    Xu, Jin
    Tan, Xu
    Wang, Jindong
    Qin, Tao
    Xu, Bo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7008 - 7012
  • [9] Data augmentation for low-resource grapheme-to-phoneme mapping
    Hammond, Michael
    SIGMORPHON 2021: 18TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS, PHONOLOGY, AND MORPHOLOGY, 2021, : 126 - 130
  • [10] Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
    Nguyen, Toan Q.
    Murray, Kenton
    Chiang, David
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 287 - 293