A greedy feature selection algorithm for Big Data of high dimensionality

被引:0
|
作者
Ioannis Tsamardinos
Giorgos Borboudakis
Pavlos Katsogridakis
Polyvios Pratikakis
Vassilis Christophides
机构
[1] University of Crete,Computer Science Department
[2] Gnosis Data Analysis PC,Institute of Computer Science
[3] Foundation for Research and Technology - Hellas,undefined
[4] INRIA,undefined
来源
Machine Learning | 2019年 / 108卷
关键词
Feature selection; Variable selection; Forward selection; Big Data; Data analytics;
D O I
暂无
中图分类号
学科分类号
摘要
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
引用
收藏
页码:149 / 202
页数:53
相关论文
共 50 条
  • [1] A greedy feature selection algorithm for Big Data of high dimensionality
    Tsamardinos, Ioannis
    Borboudakis, Giorgos
    Katsogridakis, Pavlos
    Pratikakis, Polyvios
    Christophides, Vassilis
    MACHINE LEARNING, 2019, 108 (02) : 149 - 202
  • [2] Feature Selection Using Genetic Algorithm for Big Data
    Saidi, Rania
    Ncir, Waad Bouaguel
    Essoussi, Nadia
    INTERNATIONAL CONFERENCE ON ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS (AMLTA2018), 2018, 723 : 352 - 361
  • [3] A STUDY ON FEATURE SELECTION IN BIG DATA
    Manikandan, R. P. S.
    Kalpana, A. M.
    2017 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2017,
  • [4] A New Approach for Wrapper Feature Selection Using Genetic Algorithm for Big Data
    Bouaguel, Waad
    INTELLIGENT AND EVOLUTIONARY SYSTEMS, IES 2015, 2016, 5 : 75 - 83
  • [5] Algorithm for key classification feature selection of big data based on henie theorem
    Wang W.
    International Journal of Circuits, Systems and Signal Processing, 2021, 15 : 1208 - 1213
  • [6] An ACO–ANN based feature selection algorithm for big data
    R. Joseph Manoj
    M. D. Anto Praveena
    K. Vijayakumar
    Cluster Computing, 2019, 22 : 3953 - 3960
  • [7] Challenges of Feature Selection for Big Data Analytics
    Li J.
    Liu H.
    1600, Institute of Electrical and Electronics Engineers Inc., United States (32): : 9 - 15
  • [8] Feature selection algorithm of network attack big data under the interference of fading noise
    Zheng X.
    International Journal of Computers and Applications, 2022, 44 (09) : 807 - 813
  • [9] Feature selection based on a crow search algorithm for big data classification
    Al-Thanoon, Niam Abdulmunim
    Algamal, Zakariya Yahya
    Qasim, Omar Saber
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2021, 212
  • [10] Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems
    Tareq Abed Mohammed
    Oguz Bayat
    Osman N. Uçan
    Shaymaa Alhayali
    Foundations of Science, 2020, 25 : 1009 - 1025