A greedy feature selection algorithm for Big Data of high dimensionality

被引:0
|
作者
Ioannis Tsamardinos
Giorgos Borboudakis
Pavlos Katsogridakis
Polyvios Pratikakis
Vassilis Christophides
机构
[1] University of Crete,Computer Science Department
[2] Gnosis Data Analysis PC,Institute of Computer Science
[3] Foundation for Research and Technology - Hellas,undefined
[4] INRIA,undefined
来源
Machine Learning | 2019年 / 108卷
关键词
Feature selection; Variable selection; Forward selection; Big Data; Data analytics;
D O I
暂无
中图分类号
学科分类号
摘要
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
引用
收藏
页码:149 / 202
页数:53
相关论文
共 50 条
  • [31] Feature Selection in Big Data using Filter Based Techniques
    Srinivas, Sumitra K.
    Kancharla, Gangadhara Rao
    2019 4TH MEC INTERNATIONAL CONFERENCE ON BIG DATA AND SMART CITY (ICBDSC), 2019, : 139 - 145
  • [32] Feature selection techniques in the context of big data: taxonomy and analysis
    Abdulwahab, Hudhaifa Mohammed
    Ajitha, S.
    Saif, Mufeed Ahmed Naji
    APPLIED INTELLIGENCE, 2022, 52 (12) : 13568 - 13613
  • [33] Overview Of Feature Subset Selection Algorithm For High Dimensional Data
    Gandhi, Swati S.
    Prabhune, S. S.
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INVENTIVE SYSTEMS AND CONTROL (ICISC 2017), 2017, : 618 - 623
  • [34] A Feature Selection Method for Comparision of Each Concept in Big Data
    Nakanishi, Takafumi
    2015 IEEE/ACIS 14TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2015, : 229 - 234
  • [35] Feature Selection Techniques for Big Data Analytics
    Albattah, Waleed
    Khan, Rehan Ullah
    Alsharekh, Mohammed F.
    Khasawneh, Samer F.
    ELECTRONICS, 2022, 11 (19)
  • [36] Intrusion Feature Selection Using Modified Heuristic Greedy Algorithm of Itemset
    Onpans, Janya
    Rasmequan, Suwanna
    Jantarakongkul, Benchaporn
    Chinnasarn, Krisana
    Rodtook, Annupan
    2013 13TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT): COMMUNICATION AND INFORMATION TECHNOLOGY FOR NEW LIFE STYLE BEYOND THE CLOUD, 2013, : 627 - 632
  • [37] Feature selection based on an improved cat swarm optimization algorithm for big data classification
    Kuan-Cheng Lin
    Kai-Yuan Zhang
    Yi-Hung Huang
    Jason C. Hung
    Neil Yen
    The Journal of Supercomputing, 2016, 72 : 3210 - 3221
  • [38] Hybrid Approach of SVM and Feature Selection Based Optimization Algorithm for Big Data Security
    Duhan, Bharti
    Dhankhar, Neetu
    PROCEEDINGS OF ICETIT 2019: EMERGING TRENDS IN INFORMATION TECHNOLOGY, 2020, 605 : 694 - 706
  • [39] Feature selection based on an improved cat swarm optimization algorithm for big data classification
    Lin, Kuan-Cheng
    Zhang, Kai-Yuan
    Huang, Yi-Hung
    Hung, Jason C.
    Yen, Neil
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (08) : 3210 - 3221
  • [40] Cooperative co-evolution for feature selection in Big Data with random feature grouping
    Rashid, A. N. M. Bazlur
    Ahmed, Mohiuddin
    Sikos, Leslie F.
    Haskell-Dowland, Paul
    JOURNAL OF BIG DATA, 2020, 7 (01)