Data augmentation and adversary attack on limit resources text classification

被引:0
作者
Sánchez-Vega F. [1 ,2 ]
López-Monroy A.P. [2 ]
Balderas-Paredes A. [2 ]
Pellegrin L. [3 ]
Rosales-Pérez A. [4 ]
机构
[1] Consejo Nacional de Humanidades, Ciencias y Tecnologías (CONAHCyT), Av. de los Insurgentes Sur 1582, CDMX, Col. Benito Juárez
[2] Department of Computer Science, Mathematics Research Center (CIMAT), Jalisco S/N, Col. Valenciana, GTO., Guanajuato
[3] Faculty of Sciences, Universidad Autónoma de Baja California (UABC), Transpeninsular Highway 3917, B.C., Ensenada
[4] Department of Computer Science, Mathematics Research Center (CIMAT), Alianza Centro 502, N.L., Monterrey
关键词
Adversarial attacks; Data augmentation; Deep learning applications; Instance generation; Text classification;
D O I
10.1007/s11042-024-19123-w
中图分类号
学科分类号
摘要
Data Augmentation and Adversary Attack in text are complex techniques based on the generation of new instances. This is performed by introducing some variations such as lexical and syntactic changes in previously known text examples. In both techniques, it is mandatory to preserve the general semantic meaning of the text in order to boost or to mislead the classifier in each case. The instance generation on data augmentation is especially important in a lack of data scenario such as limited-resource languages. Using the new instances as training samples could overcome data scarcity problems. In this paper, we adapt four textual instance generation methods used in some Data Augmentation and Adversary Attack methods, and we propose two more methods in order to be used in limited-resource languages environments. We have empirically quantified how much damage the Adversary Attack techniques can cause to the textual classification’s performance in low-resource scenarios. Furthermore, we explore the use of data augmentation with adversarial attack strategies to increase the robustness of classification models against the adversary attacks. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:1317 / 1344
页数:27
相关论文
共 40 条
  • [11] Devlin J., Chang M.-W., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, pp. 4171-4186, (2019)
  • [12] Jin D., Jin Z., Zhou J.T., Szolovits P., Is bert really robust? a strong baseline for natural language attack on text classification and entailment, Proceedings of the AAAI conference on artificial intelligence, 34, 5, pp. 8018-8025, (2020)
  • [13] Dwibedi D., Misra I., Hebert M., Cut, paste and learn: Surprisingly easy synthesis for instance detection, IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, pp. 1310-1319, (2017)
  • [14] Fedus W., Goodfellow I.J., Dai A.M., Maskgan: Better text generation via filling in the, 6Th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, (2018)
  • [15] Ganitkevitch J., van Durme B., Callison-Burch C., Ppdb: The paraphrase database, In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758-764, (2013)
  • [16] Gao J., Lanchantin J., Soffa M.L., Qi Y., Black-box generation of adversarial text sequences to evade deep learning classifiers, pp. 50-56, (2018)
  • [17] Goodfellow I., Bengio Y., Courville A., Deep Learning, (2016)
  • [18] Hill F., Reichart R., Korhonen A., Simlex-999: Evaluating semantic models with (genuine) similarity estimation, Comput Linguis, 41, 4, pp. 665-695, (2015)
  • [19] Hochreiter S., Schmidhuber J., Long short-term memory, Neural Comput, 9, 8, pp. 1735-1780, (1997)
  • [20] Iyyer M., Wieting J., Gimpel K., Zettlemoyer L., Adversarial example generation with syntactically controlled paraphrase networks, In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, pp. 1875-1885, (2018)