A Survey of Data Augmentation Methods in Natural Language Processing

被引:0
作者
Feng, Ran [1 ]
Chen, Danlei [1 ]
Hua, Bolin [1 ]
机构
[1] Department of Information Management, Peking University, Beijing
基金
中国国家社会科学基金;
关键词
Data Augmentation; Deep Learning; Grammar Rules; Natural Language Processing; Text Augmentation;
D O I
10.11925/infotech.2096-3467.2024.0533
中图分类号
学科分类号
摘要
[Objective] This paper comprehensively reviews the methods of text augmentation to reveal their current state of development and trends. [Coverage] Using“textual data augmentation”and“text augmentation” as search terms to retrieve literature from Web of Science, Google Scholar and CNKI, we screened out a total of 88 representative papers for review. [Methods] Text augmentation methods were categorized and summarized according to the objects of operation, the details of implementation and the diversity of generated results. On this basis, we conducted a thorough comparison of various methods with regards to their granularity, strengths, weaknesses and applications. [Results] Text augmentation approaches can be divided into text space-based methods and vector space-based methods. The former is intuitive and easily interpretable but may compromise the overall semantic structure of the text, while the latter can directly manipulate semantic features but incurs higher computational complexity. Current studies frequently necessitate external knowledge resources, such as heuristic guidelines and task-specific data. Moreover, the introduction of deep learning algorithms can enhance the novelty and diversity of generated data. [Limitations] We primarily offer a systematic examination of technical principles and performance characteristics of advanced methods, without assessing the developmental stage of platform tools quantitatively. Besides, the analysis is grounded in our chosen literatures and may not encompass all potential application scenarios of text augmentation methods. [Conclusions] Future work should pay more attention to enriching and refining the evaluation metrics for text augmentation techniques and increasing their robustness across different downstream tasks by prompt learning. Retrieval-augmented generation and graph neural networks should be taken seriously for addressing the challenges posed by lengthy texts and limited resources, which can further unlock the potential of text augmentation methods in the field of natural language processing. © 2025 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:19 / 32
页数:13
相关论文
共 98 条
[1]  
Li B H, Hou Y T, Che W X., Data Augmentation Approaches in Natural Language Processing: A Survey, AI Open, 3, pp. 71-90, (2022)
[2]  
Feng S, Gangal V, Wei J, Et al., A Survey of Data Augmentation Approaches for NLP, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 968-988, (2021)
[3]  
Shorten C, Khoshgoftaar T M, Furht B., Text Data Augmentation for Deep Learning, Journal of Big Data, 8, 1, (2021)
[4]  
Gui Tao, Xi Zhiheng, Zheng Rui, Et al., Recent Researches of Robustness in Natural Language Processing Based on Deep Neural Network, Chinese Journal of Computers, 47, 1, pp. 90-112, (2024)
[5]  
Coulombe C., Text Data Augmentation Made Simple by Leveraging NLP Cloud APIs
[6]  
Ge Yizhou, Xu Xiang, Yang Suorong, Et al., Survey on Sequence Data Augmentation, Journal of Frontiers of Computer Science and Technology, 15, 7, pp. 1207-1219, (2021)
[7]  
Pi Zhou, Xi Xuefeng, Cui Zhiming, Et al., A Data Augmentation Method for Long Text Automatic Summarization, Journal of Chinese Information Processing, 36, 9, pp. 46-56, (2022)
[8]  
Bayer M, Kaufhold M A, Reuter C., A Survey on Data Augmentation for Text Classification, ACM Computing Surveys, 55, 7, pp. 1-39, (2023)
[9]  
Belinkov Y, Bisk Y., Synthetic and Natural Noise Both Break Neural Machine Translation[C], Proceedings of the 2018 International Conference on Learning Representations, (2018)
[10]  
Karimi A, Rossi L, Prati A., AEDA: An Easier Data Augmentation Technique for Text Classification, Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2748-2754, (2021)