A greedy feature selection algorithm for Big Data of high dimensionality

被引:0
|
作者
Ioannis Tsamardinos
Giorgos Borboudakis
Pavlos Katsogridakis
Polyvios Pratikakis
Vassilis Christophides
机构
[1] University of Crete,Computer Science Department
[2] Gnosis Data Analysis PC,Institute of Computer Science
[3] Foundation for Research and Technology - Hellas,undefined
[4] INRIA,undefined
来源
Machine Learning | 2019年 / 108卷
关键词
Feature selection; Variable selection; Forward selection; Big Data; Data analytics;
D O I
暂无
中图分类号
学科分类号
摘要
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
引用
收藏
页码:149 / 202
页数:53
相关论文
共 50 条
  • [11] An ACO-ANN based feature selection algorithm for big data
    Manoj, R. Joseph
    Praveena, M. D. Anto
    Vijayakumar, K.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02): : S3953 - S3960
  • [12] Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems
    Mohammed, Tareq Abed
    Bayat, Oguz
    Ucan, Osman N.
    Alhayali, Shaymaa
    FOUNDATIONS OF SCIENCE, 2020, 25 (04) : 1009 - 1025
  • [13] Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
    Marie C Galligan
    Radka Saldova
    Matthew P Campbell
    Pauline M Rudd
    Thomas B Murphy
    BMC Bioinformatics, 14
  • [14] Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
    Galligan, Marie C.
    Saldova, Radka
    Campbell, Matthew P.
    Rudd, Pauline M.
    Murphy, Thomas B.
    BMC BIOINFORMATICS, 2013, 14
  • [15] Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics
    AlFarraj, Osama
    AlZubi, Ahmad
    Tolba, Amr
    NEURAL COMPUTING & APPLICATIONS, 2019, 31 (05) : 1391 - 1403
  • [16] Study on Feature Selection and Feature Deep Learning Model For Big Data
    Yu, Ping
    Yan, Hui
    2018 3RD INTERNATIONAL CONFERENCE ON SMART CITY AND SYSTEMS ENGINEERING (ICSCSE), 2018, : 792 - 795
  • [17] Classification of High Dimensionality Data through Feature Selection Using Markov Blanket
    Lee, Junghye
    Jun, Chi-Hyuck
    INDUSTRIAL ENGINEERING AND MANAGEMENT SYSTEMS, 2015, 14 (02): : 210 - 219
  • [18] Ensemble with Divisive Bagging for Feature Selection in Big Data
    Park, Yousung
    Kwon, Tae Yeon
    COMPUTATIONAL ECONOMICS, 2024,
  • [19] Feature Selection for Big Visual Data: Overview and Challenges
    Bolon-Canedo, Veronica
    Remeseiro, Beatriz
    Cancela, Brais
    IMAGE ANALYSIS AND RECOGNITION (ICIAR 2018), 2018, 10882 : 136 - 143
  • [20] Towards Ultrahigh Dimensional Feature Selection for Big Data
    Tan, Mingkui
    Tsang, Ivor W.
    Wang, Li
    JOURNAL OF MACHINE LEARNING RESEARCH, 2014, 15 : 1371 - 1429