Optimal Feature Selection for Imbalanced Text Classification

被引：17

作者：

Khurana A. ^{[1
]}

Verma O.P. ^{[2
]}

机构：

[1] Delhi Technological University, Department of Computer Science and Engineering, Delhi

[2] Delhi Technological University, Department of Electronics and Communication, Delhi

来源：

IEEE Transactions on Artificial Intelligence | 2023年 / 4卷 / 01期

关键词：

Class imbalance and feature selection; distributed SMOTE (D_SMOTE); modified biogeography-based optimization (M_BBO); text classification;

D O I：

10.1109/TAI.2022.3144651

中图分类号：

学科分类号：

摘要：

Textual data suffers from two main problems, large number of features and class imbalance. Many conventional approaches and their variants exist in literature to solve both these problems. The classic synthetic minority oversampling technique (SMOTE) is the most explored technique for balancing the dataset. We introduced a new algorithm to balance the dataset, named distributed SMOTE (D_SMOTE), which overcomes the problem of lack of density and reducing the formation of small disjuncts. Further, another problem handled is the large number of features or high-dimensionality. To solve high-dimensionality, a novel feature selection technique is introduced known as modified biogeography-based optimization (M_BBO). The proposed model, M_BBO, performs modification in ranking of variables using feature weighting algorithm rather than randomly ranking. We have proposed two new expressions in D_SMOTE and one new expression in M_BBO. The extensive experimental results are computed out on four text classification datasets with four machine learning classifiers. The results are concluded using three performance measures, area under curve, G-mean, and F1-score. Our empirical and statistical observation for four class-imbalanced datasets shows that the proposed D_SMOTE outperforms the other similar oversampling technique. We have also compared our proposed algorithm, M_BBO+D_SMOTE, with other models on 17 imbalanced text classification datasets. Our model outperformed the other models in 14 datasets. We have also compared our model with bidirectional encoder representations from transformers. To validate the experimental analysis, statistical Friedman test is employed. © 2020 IEEE.

引用

页码：135 / 147

页数：12

共 50 条

[1] Comparison of metrics for feature selection in imbalanced text classification
Ogura, Hiroshi
Amano, Hiromi
Kondo, Masato
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (05) : 4978 - 4989
[2] A feature selection method to handle imbalanced data in text classification
Chang, Fengxiang
Guo, Jun
Xu, Weiran
Yao, Kejun
Journal of Digital Information Management, 2015, 13 (03): : 169 - 175
[3] FISA: Feature-based instance selection for imbalanced text classification
Sun, Aixin
Lim, Ee-Peng
Benatallah, Boualem
Hassan, Mahbub
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2006, 3918 : 250 - 254
[4] Feature selection method on imbalanced text
Liao, Yi-Xing
Pan, Xue-Zeng
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2012, 41 (04): : 592 - 595
[5] Feature Selection in Text Classification
Sahin, Durmus Ozkan
Ates, Nurullah
Kilic, Erdal
2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
[6] FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA
Yee, Jen-Yuan
Tsai, Cheng-Jung
Hsu, Tien-Yu
Lin, Jung-Yi
Cheng, Pei-Cheng
MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2021, 34 (04) : 355 - 373
[7] An optimal approach for text feature selection
El-Hajj, Wassim
Hajj, Hazem
COMPUTER SPEECH AND LANGUAGE, 2022, 74
[8] Dynamic feature selection in text classification
Doan, Son
Horiguchi, Susumu
INTELLIGENT CONTROL AND AUTOMATION, 2006, 344 : 664 - 675
[9] Contextual feature selection for text classification
Paradis, Francois
Nie, Jian-Yun
INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (02) : 344 - 352
[10] Hybrid feature selection for text classification
Gunal, Serkan
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1296 - 1311

← 1 2 3 4 5 →