Text data augmentations: Permutation, antonyms and negation

被引:26
作者
Haralabopoulos, Giannis [1 ]
Torres, Mercedes Torres [1 ]
Anagnostopoulos, Ioannis [2 ]
McAuley, Derek [1 ]
机构
[1] Triumph Rd, Nottingham NG7 2TU, England
[2] 2-4 Papassiopoulou Str, Lamia 35100, Greece
基金
英国工程与自然科学研究理事会;
关键词
Text; Augmentation; Multilabel; Multiclass; LSTM;
D O I
10.1016/j.eswa.2021.114769
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text has traditionally been used to train automated classifiers for a multitude of purposes, such as: classification, topic modelling and sentiment analysis. State-of-the-art LSTM classifier require a large number of training examples to avoid biases and successfully generalise. Labelled data greatly improves classification results, but not all modern datasets include large numbers of labelled examples. Labelling is a complex task that can be expensive, time-consuming, and potentially introduces biases. Data augmentation methods create synthetic data based on existing labelled examples, with the goal of improving classification results. These methods have been successfully used in image classification tasks and recent research has extended them to text classification. We propose a method that uses sentence permutations to augment an initial dataset, while retaining key statistical properties of the dataset. We evaluate our method with eight different datasets and a baseline Deep Learning process. This permutation method significantly improves classification accuracy by an average of 4.1%. We also propose two more text augmentations that reverse the classification of each augmented example, antonym and negation. We test these two augmentations in three eligible datasets, and the results suggest an -averaged, across all datasets-improvement in classification accuracy of 0.35% for antonym and 0.4% for negation, when compared to our proposed permutation augmentation.
引用
收藏
页数:7
相关论文
共 49 条
[1]  
Acharya Anish, 2019, AAAI CONF ARTIF INTE, V33, P6196
[2]  
[Anonymous], 2017, Proceedings of the Eighth International Joint Conference on Natural Language Processing (Short Papers)
[3]   A review of feature selection methods on synthetic data [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) :483-519
[4]  
Camacho-Collados J, 2017, ARXIV PREPRINT ARXIV
[5]  
Carolina F., 2020, T HUMAN ROBOT INTERA
[6]   Turning from TF-IDF to TF-IGM for term weighting in text classification [J].
Chen, Kewen ;
Zhang, Zuping ;
Long, Jun ;
Zhang, Hao .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :245-260
[7]  
Collobert R, 2011, J MACH LEARN RES, V12, P2493
[8]  
Coulombe Claude, 2018, ARXIV PREPRINT ARXIV
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]  
Devlin Jacob, 2018, P C N AM CHAPT ASS C