Optimal Feature Selection for Imbalanced Text Classification

被引:17
|
作者
Khurana A. [1 ]
Verma O.P. [2 ]
机构
[1] Delhi Technological University, Department of Computer Science and Engineering, Delhi
[2] Delhi Technological University, Department of Electronics and Communication, Delhi
来源
IEEE Transactions on Artificial Intelligence | 2023年 / 4卷 / 01期
关键词
Class imbalance and feature selection; distributed SMOTE (D_SMOTE); modified biogeography-based optimization (M_BBO); text classification;
D O I
10.1109/TAI.2022.3144651
中图分类号
学科分类号
摘要
Textual data suffers from two main problems, large number of features and class imbalance. Many conventional approaches and their variants exist in literature to solve both these problems. The classic synthetic minority oversampling technique (SMOTE) is the most explored technique for balancing the dataset. We introduced a new algorithm to balance the dataset, named distributed SMOTE (D_SMOTE), which overcomes the problem of lack of density and reducing the formation of small disjuncts. Further, another problem handled is the large number of features or high-dimensionality. To solve high-dimensionality, a novel feature selection technique is introduced known as modified biogeography-based optimization (M_BBO). The proposed model, M_BBO, performs modification in ranking of variables using feature weighting algorithm rather than randomly ranking. We have proposed two new expressions in D_SMOTE and one new expression in M_BBO. The extensive experimental results are computed out on four text classification datasets with four machine learning classifiers. The results are concluded using three performance measures, area under curve, G-mean, and F1-score. Our empirical and statistical observation for four class-imbalanced datasets shows that the proposed D_SMOTE outperforms the other similar oversampling technique. We have also compared our proposed algorithm, M_BBO+D_SMOTE, with other models on 17 imbalanced text classification datasets. Our model outperformed the other models in 14 datasets. We have also compared our model with bidirectional encoder representations from transformers. To validate the experimental analysis, statistical Friedman test is employed. © 2020 IEEE.
引用
收藏
页码:135 / 147
页数:12
相关论文
共 50 条
  • [1] Comparison of metrics for feature selection in imbalanced text classification
    Ogura, Hiroshi
    Amano, Hiromi
    Kondo, Masato
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (05) : 4978 - 4989
  • [2] A feature selection method to handle imbalanced data in text classification
    Chang, Fengxiang
    Guo, Jun
    Xu, Weiran
    Yao, Kejun
    Journal of Digital Information Management, 2015, 13 (03): : 169 - 175
  • [3] FISA: Feature-based instance selection for imbalanced text classification
    Sun, Aixin
    Lim, Ee-Peng
    Benatallah, Boualem
    Hassan, Mahbub
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2006, 3918 : 250 - 254
  • [4] Feature selection method on imbalanced text
    Liao, Yi-Xing
    Pan, Xue-Zeng
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2012, 41 (04): : 592 - 595
  • [5] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [6] FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA
    Yee, Jen-Yuan
    Tsai, Cheng-Jung
    Hsu, Tien-Yu
    Lin, Jung-Yi
    Cheng, Pei-Cheng
    MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2021, 34 (04) : 355 - 373
  • [7] An optimal approach for text feature selection
    El-Hajj, Wassim
    Hajj, Hazem
    COMPUTER SPEECH AND LANGUAGE, 2022, 74
  • [8] Dynamic feature selection in text classification
    Doan, Son
    Horiguchi, Susumu
    INTELLIGENT CONTROL AND AUTOMATION, 2006, 344 : 664 - 675
  • [9] Contextual feature selection for text classification
    Paradis, Francois
    Nie, Jian-Yun
    INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (02) : 344 - 352
  • [10] Hybrid feature selection for text classification
    Gunal, Serkan
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1296 - 1311