Self-Supervised Contextual Data Augmentation for Natural Language Processing

被引:13
作者
Park, Dongju [1 ]
Ahn, Chang Wook [1 ]
机构
[1] Gwangju Inst Sci & Technol, Elect Engn & Comp Sci, Gwangju 61005, South Korea
来源
SYMMETRY-BASEL | 2019年 / 11卷 / 11期
基金
新加坡国家研究基金会;
关键词
data augmentation; self-supervised learning; natural language processing; text classification;
D O I
10.3390/sym11111393
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.
引用
收藏
页数:16
相关论文
共 54 条
[1]   On access control, data integration, and their languages [J].
Abadi, M .
COMPUTER SYSTEMS: THEORY, TECHNOLOGY AND APPLICATIONS: A TRIBUTE TO ROGER NEEDHAM, 2004, :9-14
[2]  
[Anonymous], 2017, P INT C LEARN REPR T
[3]  
[Anonymous], 2015, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], P INT C COMP VIS ICC
[5]  
[Anonymous], P WORKSH MACH LEARN
[6]  
[Anonymous], 2019, ARXIV190107291
[7]  
[Anonymous], ADV NEURAL INFORM PR
[8]  
[Anonymous], P 16 ANN C INT SPEEC
[9]  
Aytar Y., 2018, ADV NEURAL INFORM PR, P2930
[10]  
Balaji A., 2018, ARXIV180806492