Saliency-Based Token Swap - A Language-Agnostic Data Augmentation Method for Text Classification

被引:0
作者
Ilangeshwaran, Hiroshan [1 ]
Abeywardhana, Lakmini [1 ]
Rathnayake, Samadhi [1 ]
机构
[1] Sri Lanka Inst Informat Technol, Dept Informat Technol, Colombo, Sri Lanka
来源
2024 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY RESEARCH, ICITR | 2024年
关键词
data augmentation; text classification; SHAP values; low-resource languages;
D O I
10.1109/ICITR64794.2024.10857798
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data scarcity remains a significant challenge in text classification, often resulting in suboptimal performance of machine learning models. To address this issue, this paper introduces Saliency-based token Swap (SSwap), an innovative data augmentation technique designed to enhance classification performance by operating through strategies that utilize saliency values to swap tokens between sentences. SSwap was assessed across varying levels of data availability, evaluated on low-resource languages (Sinhala and Tamil), and compared with existing data augmentation methods. Experiments demonstrated consistent improvement in text classification performance, particularly in low-resource settings. Results underscored SSwap's potential as a valuable tool for sustaining robust model performance in scenarios with limited data and low-resource languages, with implications for a wide range of text classification tasks.
引用
收藏
页数:5
相关论文
共 21 条
  • [1] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [2] Coulombe C, 2018, Arxiv, DOI [arXiv:1812.04718, 10.48550/arXiv.1812.04718, DOI 10.48550/ARXIV.1812.04718]
  • [3] Demszky D, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4040
  • [4] Logistic regression and artificial neural network classification models: a methodology review
    Dreiseitl, S
    Ohno-Machado, L
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2002, 35 (5-6) : 352 - 359
  • [5] Edunov S, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P489
  • [6] Feng SY, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P968
  • [7] Gallegos I. O., 2024, Comput. Linguist., P1, DOI [DOI 10.1162/COLIA00524, 10.1162/colia00524]
  • [8] Gupta R, 2019, INT CONF ACOUST SPEE, P7380, DOI [10.1109/icassp.2019.8682544, 10.1109/ICASSP.2019.8682544]
  • [9] Jenarthanan R, 2019, 2019 MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON) / 5TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, P49, DOI [10.1109/MERCon.2019.8818760, 10.1109/mercon.2019.8818760]
  • [10] Kryscinski W, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P9332