Classifier Learning from Imbalanced Corpus by Autoencoded Over-Sampling

被引:0
作者
Park, Eunkyung [1 ]
Wong, Raymond K. [1 ]
Chu, Victor W. [2 ]
机构
[1] Univ New South Wales, Sydney, NSW, Australia
[2] Nanyang Technol Univ, Singapore, Singapore
来源
PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT I | 2019年 / 11670卷
关键词
Imbalanced learning; Sentiment analysis; Over-sampling; Autoencoding; SMOTE;
D O I
10.1007/978-3-030-29908-8_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance is a common problem in classifier learning but it is difficult to solve. Textual data are ubiquitous and their analytics have great potential in many applications. In this paper, we propose a solution to build accurate sentiment classifiers from imbalanced textual data. We first establish topic vectors to capture local and global patterns from a corpus. Synthetic minority over-sampling technique is then used to balance the data while avoiding overfitting. However, we found that residue overfitting is still prominent. To address this problem, we propose an autoencoded oversampling framework to reconstruct balanced datasets. Our extensive experiments on different datasets with various imbalanced ratios and number of classes have found that our approach is sound and effective.
引用
收藏
页码:16 / 29
页数:14
相关论文
共 31 条
[1]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[2]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[3]   DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique [J].
Bunkhumpornpat, Chumphol ;
Sinapiromsaran, Krung ;
Lursinsap, Chidchanok .
APPLIED INTELLIGENCE, 2012, 36 (03) :664-684
[4]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]   KATE: K-Competitive Autoencoder for Text [J].
Chen, Yu ;
Zaki, Mohammed J. .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :85-94
[7]  
Faruqui M, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P1491
[8]   Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [J].
Han, H ;
Wang, WY ;
Mao, BH .
ADVANCES IN INTELLIGENT COMPUTING, PT 1, PROCEEDINGS, 2005, 3644 :878-887
[9]   ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning [J].
He, Haibo ;
Bai, Yang ;
Garcia, Edwardo A. ;
Li, Shutao .
2008 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-8, 2008, :1322-1328
[10]  
Japkowicz N., 2004, SIGKDD Explorations, V6, P1