Distributed multi-label feature selection using individual mutual information measures

被引:95
作者
Gonzalez-Lopez, Jorge [1 ]
Ventura, Sebastian [2 ]
Cano, Alberto [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
[2] Univ Cordoba, Dept Comp Sci & Numer Anal, Cordoba, Spain
关键词
Multi-label learning; Feature selection; Mutual information; Distributed computing; Apache spark; CLASSIFICATION; TRANSFORMATION; ALGORITHM; SPARK; KNN;
D O I
10.1016/j.knosys.2019.105052
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-label learning generalizes traditional learning by allowing an instance to belong to multiple labels simultaneously. This causes multi-label data to be characterized by its large label space dimensionality and the dependencies among labels. These challenges have been addressed by feature selection techniques which improve the final model accuracy. However, the large number of features along with a large number of labels call for new approaches to manage data effectively and efficiently in distributed computing environments. This paper proposes a distributed model to compute a score that measures the quality of each feature with respect to multiple labels on Apache Spark. We propose two different approaches that study how to aggregate the mutual information of multiple labels: Euclidean Norm Maximization (ENM) and Geometric Mean Maximization (GMM). The former selects the features with the largest L-2-norm whereas the latter selects the features with the largest geometric mean. Experiments compare 9 distributed multi-label feature selection methods on 12 datasets and 12 metrics. Results validated through statistical analysis indicate that ENM is able to outperform the reference methods by maximizing the relevance while minimizing the redundancy of the selected features in constant selection time. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 65 条
[41]   Generalized Information-Theoretic Criterion for Multi-Label Feature Selection [J].
Seo, Wangduk ;
Kim, Dae-Won ;
Lee, Jaesung .
IEEE ACCESS, 2019, 7 (122854-122863) :122854-122863
[42]   A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach [J].
Spolaor, Newton ;
Cherman, Everton Alvares ;
Monard, Maria Carolina ;
Lee, Huei Diana .
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2013, 292 :135-151
[43]   ReliefF for Multi-label Feature Selection [J].
Spolaor, Newton ;
Cherman, Everton Alvares ;
Monard, Maria Carolina ;
Lee, Huei Diana .
2013 BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 2013, :6-11
[44]   Mutual information based multi-label feature selection via constrained convex optimization [J].
Sun, Zhenqiang ;
Zhang, Jia ;
Dai, Liang ;
Li, Candong ;
Zhou, Changen ;
Xin, Jiliang ;
Li, Shaozi .
NEUROCOMPUTING, 2019, 329 :447-456
[45]   Big data time series forecasting based on nearest neighbours distributed computing with Spark [J].
Talavera-Llames, R. ;
Perez-Chacon, R. ;
Troncoso, A. ;
Martinez-Alvarez, F. .
KNOWLEDGE-BASED SYSTEMS, 2018, 161 :12-25
[46]   A Framework to Generate Synthetic Multi-label Datasets [J].
Tomas, Jimena Torres ;
Spolaor, Newton ;
Cherman, Everton Alvares ;
Monard, Maria Carolina .
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2014, 302 (302) :155-176
[47]  
Tsoumakas G, 2007, LECT NOTES ARTIF INT, V4701, P406
[48]  
Tsoumakas G, 2010, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, SECOND EDITION, P667, DOI 10.1007/978-0-387-09823-4_34
[49]   Feature Selection via Global Redundancy Minimization [J].
Wang, De ;
Nie, Feiping ;
Huang, Heng .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (10) :2743-2755
[50]   Feature Selection by Maximizing Independent Classification Information [J].
Wang, Jun ;
Wei, Jin-Mao ;
Yang, Zhenglu ;
Wang, Shu-Qin .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (04) :828-841