Data augmentation and adversary attack on limit resources text classification

被引:0
作者
Sánchez-Vega F. [1 ,2 ]
López-Monroy A.P. [2 ]
Balderas-Paredes A. [2 ]
Pellegrin L. [3 ]
Rosales-Pérez A. [4 ]
机构
[1] Consejo Nacional de Humanidades, Ciencias y Tecnologías (CONAHCyT), Av. de los Insurgentes Sur 1582, CDMX, Col. Benito Juárez
[2] Department of Computer Science, Mathematics Research Center (CIMAT), Jalisco S/N, Col. Valenciana, GTO., Guanajuato
[3] Faculty of Sciences, Universidad Autónoma de Baja California (UABC), Transpeninsular Highway 3917, B.C., Ensenada
[4] Department of Computer Science, Mathematics Research Center (CIMAT), Alianza Centro 502, N.L., Monterrey
关键词
Adversarial attacks; Data augmentation; Deep learning applications; Instance generation; Text classification;
D O I
10.1007/s11042-024-19123-w
中图分类号
学科分类号
摘要
Data Augmentation and Adversary Attack in text are complex techniques based on the generation of new instances. This is performed by introducing some variations such as lexical and syntactic changes in previously known text examples. In both techniques, it is mandatory to preserve the general semantic meaning of the text in order to boost or to mislead the classifier in each case. The instance generation on data augmentation is especially important in a lack of data scenario such as limited-resource languages. Using the new instances as training samples could overcome data scarcity problems. In this paper, we adapt four textual instance generation methods used in some Data Augmentation and Adversary Attack methods, and we propose two more methods in order to be used in limited-resource languages environments. We have empirically quantified how much damage the Adversary Attack techniques can cause to the textual classification’s performance in low-resource scenarios. Furthermore, we explore the use of data augmentation with adversarial attack strategies to increase the robustness of classification models against the adversary attacks. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:1317 / 1344
页数:27
相关论文
共 40 条
  • [1] Alzantot M., Sharma Y., Elgohary A., Ho B.-J., Srivastava M., Chang K.-W., Generating natural language adversarial examples, Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2890-2896, (2018)
  • [2] Ao W., He Y.-L., Huang J.Z., He Y., Improving generalization capability of extreme learning machine with synthetic instances generation, Neural Information Processing - 24Th International Conference, ICONIP 2017, Guangzhou, China, Proceedings, Part I, 10634, pp. 3-12, (2017)
  • [3] Aragon M.E., Jarquin-Vasquez H.J., Montes-Y-gomez M., Escalante H.J., Pineda L.V., Gomez-Adorno H., Posadas-Duran J.P., Bel-Enguix G., Overview of MEX-A3T at iberlef 2020: Fake news and aggressiveness analysis in mexican spanish, Proceedings of the Iberian Languages Evaluation Forum (Iberlef 2020) Co-Located with 36Th Conference of the Spanish Society for Natural Language Processing (SEPLN 2020), 2664, pp. 222-235, (2020)
  • [4] Barzegar S., Davis B., Zarrouk M., Handschuh S., Freitas A., Semr-11: A multi-lingual gold-standard for semantic similarity and relatedness for eleven languages, In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, European Language Resources Association (ELRA), (2018)
  • [5] Bueno R.O., Pardo F.M.R., Hernandez Farias D.I., Rosso P., Montes-Y-gomez M., Medina-Pagola J., Overview of the task on irony detection in spanish variants, Proceedings of the Iberian Languages Evaluation Forum Co-Located with 35Th Conference of the Spanish Society for Natural Language Processing, Iberlef@Sepln 2019, Bilbao, Spain, 2421, pp. 229-256, (2019)
  • [6] Canete J., Chaperon G., Fuentes R., Perez J., Spanish Pre-Trained Bert Model and Evaluation Data, (2020)
  • [7] Cer D., Yang Y., S-Yhua N., Limtiaco N., John R.S., Constant N., Guajardo-Cespedes M., Yuan S., Tar C., Strope B., Kurzweil R., Universal sentence encoder for English, In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, (2018)
  • [8] Chelba C., Mikolov T., Schuster M., Ge Q., Brants T., Koehn P., Robinson T., One billion word benchmark for measuring progress in statistical language modeling, In: Fifteenth Annual Conference of the International Speech Communication Association, (2014)
  • [9] Cieri C., Maxwell M., Strassel S.M., Tracey J., Selection criteria for low resource language programs, Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, (2016)
  • [10] Cubuk E.D., Zoph B., Mane D., Vasudevan V., Le Q.V., Autoaugment: Learning augmentation policies from data, In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019)