A greedy feature selection algorithm for Big Data of high dimensionality

被引:0
|
作者
Ioannis Tsamardinos
Giorgos Borboudakis
Pavlos Katsogridakis
Polyvios Pratikakis
Vassilis Christophides
机构
[1] University of Crete,Computer Science Department
[2] Gnosis Data Analysis PC,Institute of Computer Science
[3] Foundation for Research and Technology - Hellas,undefined
[4] INRIA,undefined
来源
Machine Learning | 2019年 / 108卷
关键词
Feature selection; Variable selection; Forward selection; Big Data; Data analytics;
D O I
暂无
中图分类号
学科分类号
摘要
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
引用
收藏
页码:149 / 202
页数:53
相关论文
共 50 条
  • [21] RHDSI: A novel dimensionality reduction based algorithm on high dimensional feature selection with interactions
    Jain, Rahi
    Xu, Wei
    INFORMATION SCIENCES, 2021, 574 : 590 - 605
  • [22] Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection
    Mohamad, Masurah
    Selamat, Ali
    Krejcar, Ondrej
    Crespo, Ruben Gonzalez
    Herrera-Viedma, Enrique
    Fujita, Hamido
    ELECTRONICS, 2021, 10 (23)
  • [23] The Data Dimensionality Reduction in the Classification Process Through Greedy Backward Feature Elimination
    Kostrzewa, Daniel
    Brzeski, Robert
    MAN-MACHINE INTERACTIONS 5, ICMMI 2017, 2018, 659 : 397 - 407
  • [24] Incomplete Big Data Clustering Algorithm Using Feature Selection and Partial Distance
    Bu, Fanyu
    Chen, Zhikui
    Zhang, Qingchen
    Wang, Xin
    2014 5TH INTERNATIONAL CONFERENCE ON DIGITAL HOME (ICDH), 2014, : 263 - 266
  • [25] Feature subset selection and ranking for data dimensionality reduction
    Wei, Hua-Liang
    Billings, Stephen A.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (01) : 162 - 166
  • [26] Stepwise optimal feature selection for data dimensionality reduction
    Qin, Lifeng
    He, Dongjian
    Long, Yan
    Journal of Computational Information Systems, 2015, 11 (05): : 1647 - 1656
  • [27] Feature Selection with Annealing for Computer Vision and Big Data Learning
    Barbu, Adrian
    She, Yiyuan
    Ding, Liangjing
    Gramajo, Gary
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (02) : 272 - 286
  • [28] Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey
    Seliya, Naeem
    2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), 2019, : 346 - 356
  • [29] Distributed Feature Selection for Efficient Economic Big Data Analysis
    Zhao, Liang
    Chen, Zhikui
    Hu, Yueming
    Min, Geyong
    Jiang, Zhaohua
    IEEE TRANSACTIONS ON BIG DATA, 2018, 4 (02) : 164 - 176
  • [30] Feature Selection and Classification of Big Data Using MapReduce Framework
    Devi, D. Renuka
    Sasikala, S.
    INTELLIGENT COMPUTING, INFORMATION AND CONTROL SYSTEMS, ICICCS 2019, 2020, 1039 : 666 - 673