A greedy feature selection algorithm for Big Data of high dimensionality

被引:0
|
作者
Ioannis Tsamardinos
Giorgos Borboudakis
Pavlos Katsogridakis
Polyvios Pratikakis
Vassilis Christophides
机构
[1] University of Crete,Computer Science Department
[2] Gnosis Data Analysis PC,Institute of Computer Science
[3] Foundation for Research and Technology - Hellas,undefined
[4] INRIA,undefined
来源
Machine Learning | 2019年 / 108卷
关键词
Feature selection; Variable selection; Forward selection; Big Data; Data analytics;
D O I
暂无
中图分类号
学科分类号
摘要
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
引用
收藏
页码:149 / 202
页数:53
相关论文
共 50 条
  • [41] Feature Selection Algorithm for Noise Data
    Xu H.
    Zhang S.-C.
    Wu Z.-J.
    Li J.-Y.
    Zhang, Shi-Chao (zhangsc@csu.edu.cn), 1600, Chinese Academy of Sciences (32): : 3440 - 3451
  • [42] Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data
    Yamada, Makoto
    Tang, Jiliang
    Lugo-Martinez, Jose
    Hodzic, Ermin
    Shrestha, Raunak
    Saha, Avishek
    Ouyang, Hua
    Yin, Dawei
    Mamitsuka, Hiroshi
    Sahinalp, Cenk
    Radivojac, Predrag
    Menczer, Filippo
    Chang, Yi
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (07) : 1352 - 1365
  • [43] Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classification
    Haritha, K.
    Judy, M., V
    Papageorgiou, Konstantinos
    Georgiannis, Vassilis C.
    Papageorgiou, Elpiniki
    ALGORITHMS, 2022, 15 (10)
  • [44] Link based BPSO for feature selection in big data text clustering
    Kushwaha, Neetu
    Pant, Millie
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 82 : 190 - 199
  • [45] DQPFS: Distributed quadratic programming based feature selection for big data
    Soheili, Majid
    Eftekhari-Moghadam, Amir Masoud
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 138 : 1 - 14
  • [46] Recent advances and emerging challenges of feature selection in the context of big data
    Bolon-Canedo, V.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    KNOWLEDGE-BASED SYSTEMS, 2015, 86 : 33 - 45
  • [47] Feature selection for high-dimensional data
    Bolón-Canedo V.
    Sánchez-Maroño N.
    Alonso-Betanzos A.
    Progress in Artificial Intelligence, 2016, 5 (2) : 65 - 75
  • [48] Carousel Greedy Algorithms for Feature Selection in Linear Regression
    Wang, Jiaqi
    Golden, Bruce
    Cerrone, Carmine
    ALGORITHMS, 2023, 16 (09)
  • [49] Accelerated PSO Swarm Search Feature Selection for Data Stream Mining Big Data
    Fong, Simon
    Wong, Raymond
    Vasilakos, Athanasios V.
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2016, 9 (01) : 33 - 45
  • [50] Feature Selection and Its Use in Big Data: Challenges, Methods, and Trends
    Rong, Miao
    Gong, Dunwei
    Gao, Xiaozhi
    IEEE ACCESS, 2019, 7 : 19709 - 19725