LiDA: Language-Independent Data Augmentation for Text Classification

被引:4
作者
Sujana, Yudianto [1 ,2 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Intelligent Knowledge Management Lab, Tainan 701, Taiwan
[2] Univ Sebelas Maret, Fac Teacher Training & Educ, Surakarta 57126, Indonesia
关键词
Data models; Synthetic data; Text categorization; Text mining; Transforms; Resource management; Data augmentation; low-resource language; text classification;
D O I
10.1109/ACCESS.2023.3234019
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Developing a high-performance text classification model in a low-resource language is challenging due to the lack of labeled data. Meanwhile, collecting large amounts of labeled data is cost-inefficient. One approach to increase the amount of labeled data is to create synthetic data using data augmentation techniques. However, most of the available data augmentation techniques work on English data and are highly language-dependent as they perform at the word and sentence level, such as replacing some words or paraphrasing a sentence. We present Language-independent Data Augmentation (LiDA), a technique that utilizes a multilingual language model to create synthetic data from the available training dataset. Unlike other methods, our approach worked on the sentence embedding level independent of any particular language. We evaluated LiDA in three languages on various fractions of the dataset, and the result showed improved performance in both the LSTM and BERT models. Furthermore, we conducted an ablation study to determine the impact of the components in our method on overall performance. The source code of LiDA is available at https://github.com/yest/LiDA.
引用
收藏
页码:10894 / 10901
页数:8
相关论文
共 24 条
  • [1] Anaby-Tavor A, 2020, AAAI CONF ARTIF INTE, V34, P7383
  • [2] Bond F., 2012, P 6 GLOBAL WORDNET, P64
  • [3] Cahyawijaya S., 2021, C EMPIRICAL METHODS, P843
  • [4] Conneau A., 2020, P 58 ANN M ASS COMP, P8440, DOI 10.18653/v1/2020.acl-main.747
  • [5] Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
  • [6] Dong X, 2018, AAAI CONF ARTIF INTE, P5771
  • [7] Edunov S, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P489
  • [8] Garg S, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6174
  • [9] Kobayashi Sosuke, 2018, P 2018 C N AM CHAPT
  • [10] Mueller J, 2016, AAAI CONF ARTIF INTE, P2786