Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data

被引:9
作者
Li, Kewen [1 ]
Yu, Mingxiao [1 ]
Liu, Lu [1 ]
Li, Timing [2 ]
Zhai, Jiannan [3 ]
机构
[1] China Univ Petr East China, Coll Comp & Commun Engn, Qingdao, Shandong, Peoples R China
[2] Tianjin Univ, Sch Microelect, Tianjin 300072, Peoples R China
[3] Florida Atlantic Univ, Inst Sensing & Embedded Network Syst Engn, 777 Glades Rd, Boca Raton, FL 33431 USA
基金
中国国家自然科学基金;
关键词
Feature selection; fuzzy c-means clustering; imbalanced data; mutual information; FUZZY C-MEANS; MEANS ALGORITHM;
D O I
10.1142/S0218194018500341
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The class imbalance problem has negative effects on the performance of feature selection in imbalanced data. Traditional feature selection algorithms always study on the balanced class distribution of the data and improve the overall classification accuracy for the optimization goal, which tends to be overwhelmed by the large classes, ignoring the small ones. This paper proposes a novel feature selection method based on the weighted mutual information (WMI) for the imbalanced data, defined as WMI algorithm. The WMI algorithm assigns different weights to the samples based on the fuzzy c-means (FCM) clustering algorithm and then calculates the mutual information based on the weight of each sample. This paper used the AUC as the evaluation criterion of the selected feature. At last, four unbalanced datasets from NASA software defect datasets are used to validate the proposed approach. Experimental results show that the proposed method achieves higher prediction accuracy of both minority class and majority class.
引用
收藏
页码:1177 / 1194
页数:18
相关论文
共 37 条
[1]   DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets [J].
Alibeigi, Mina ;
Hashemi, Sattar ;
Hamzeh, Ali .
DATA & KNOWLEDGE ENGINEERING, 2012, 81-82 :67-103
[2]  
[Anonymous], 1981, ADV APPL PATTERN
[3]  
[Anonymous], 2008, P 14 ACM SIGKDD INT, DOI 10.1145/1401890.1401910
[4]   Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure [J].
Chen, SC ;
Zhang, DQ .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2004, 34 (04) :1907-1916
[5]   Fuzzy c-means clustering methods for symbolic interval data [J].
de Carvalho, Francisco de A. T. .
PATTERN RECOGNITION LETTERS, 2007, 28 (04) :423-437
[6]   Least angle regression - Rejoinder [J].
Efron, B ;
Hastie, T ;
Johnstone, I ;
Tibshirani, R .
ANNALS OF STATISTICS, 2004, 32 (02) :494-499
[7]   Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification [J].
Frenay, Benoit ;
Doquire, Gauthier ;
Verleysen, Michel .
NEUROCOMPUTING, 2013, 112 :64-78
[8]   Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422
[9]  
Guyon I., 2003, INTRO VARIABLE FEATU
[10]   Learning from Imbalanced Data [J].
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (09) :1263-1284