Feature selection techniques in the context of big data: taxonomy and analysis

被引:41
作者
Abdulwahab, Hudhaifa Mohammed [1 ]
Ajitha, S. [1 ]
Saif, Mufeed Ahmed Naji [2 ]
机构
[1] VTU Univ, Dept Comp Applicat, Ramaiah Inst Technol, Bangalore, Karnataka, India
[2] VTU Univ, Sri Jayachamarajendra Coll Engn, Dept Comp Applicat, Mysore, Karnataka, India
关键词
Big Data; Dimensionality Reduction; Feature Selection; Streaming Feature; PARTICLE SWARM OPTIMIZATION; ANT COLONY OPTIMIZATION; EMBEDDED FEATURE-SELECTION; STREAMING FEATURE-SELECTION; FEATURE SUBSET-SELECTION; LABEL FEATURE-SELECTION; ONLINE FEATURE-SELECTION; FLOATING SEARCH METHODS; TEXT FEATURE-SELECTION; GENE SELECTION;
D O I
10.1007/s10489-021-03118-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advancements in Information Technology (IT) have engendered the rapid production of big data, as enormous volumes of data with high dimensional features grow exponentially in different fields. Therefore, dealing with high-dimensional data creates new challenges in terms of data processing efficiency and effectiveness. To address such challenges, Feature Selection (FS) is among the most utilized dimensionality reduction methods, which is helpful in reducing the high dimensionality of large-scale data by picking up a small subset of related and significant features and eliminating unrelated and redundant features in order to construct effective prediction models. This article provides a comprehensive review of the latest FS approaches in the context of big data along with a structured taxonomy, which categorizes the existing methods based on their nature, search strategy, evaluation process, and feature structure. Moreover, it presents a qualitative analysis of FS methods based on their objective, structure, search strategy, schema, learning task, strengths, and weaknesses. Further, a quantitative analysis is also performed to illustrate the number of publications related to FS based on the timeline, main category, and other sub-categories. An experimental study is also conducted comparing ten methods from different categories using twelve benchmark datasets from the University of California, Irvine (UCI) Machine Learning Repository and Arizona State University (ASU) Feature Selection Repository to evaluate their performance in terms of (accuracy, precision, recall, F-measures, and the number of selected features). Finally, we highlight the research issues and open challenges related to FS to assist researchers in identifying future research directions.
引用
收藏
页码:13568 / 13613
页数:46
相关论文
共 228 条
[1]  
Abasi Ammar Kamal, 2021, Proceedings of the 11th National Technical Seminar on Unmanned System Technology 2019. NUSYS19. Lecture Notes in Electrical Engineering (LNEE 666), P503, DOI 10.1007/978-981-15-5281-6_34
[2]   A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection [J].
Abdel-Basset, Mohamed ;
El-Shahat, Doaa ;
El-henawy, Ibrahim ;
de Albuquerque, Victor Hugo C. ;
Mirjalili, Seyedali .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 139
[3]   A parallel hybrid krill herd algorithm for feature selection [J].
Abualigah, Laith ;
Alsalibi, Bisan ;
Shehab, Mohammad ;
Alshinwan, Mohammad ;
Khasawneh, Ahmad M. ;
Alabool, Hamzeh .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (03) :783-806
[4]  
Aggrawal Ritu., 2020, SN Computer Science, V1, P344, DOI DOI 10.1007/S42979-020-00370-1
[5]   Text feature selection using ant colony optimization [J].
Aghdam, Mehdi Hosseinzadeh ;
Ghasem-Aghaee, Nasser ;
Basiri, Mohammad Ehsan .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :6843-6853
[6]   Ant colony optimization for text feature selection in sentiment analysis [J].
Ahmad, Siti Rohaidah ;
Abu Bakar, Azuraliza ;
Yaaku, Mohd Ridzwan .
INTELLIGENT DATA ANALYSIS, 2019, 23 (01) :133-158
[7]   Online streaming feature selection with incremental feature grouping [J].
Al Nuaimi, Noura ;
Masud, Mohammad M. .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 10 (04)
[8]   Toward Optimal Streaming Feature Selection [J].
Al Nuaimi, Noura ;
Masud, Mohammad M. .
2017 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2017, :775-782
[9]   A TRIZ-inspired bat algorithm for gene selection in cancer classification [J].
Al-Betar, Mohammed Azmi ;
Alomari, Osama Ahmad ;
Abu-Romman, Saeid M. .
GENOMICS, 2020, 112 (01) :114-126
[10]  
Al-Zoubi AM, 2020, ALGO INTELL SY, P11, DOI 10.1007/978-981-32-9990-0_2