Towards Ultrahigh Dimensional Feature Selection for Big Data

被引:0
|
作者
Tan, Mingkui [1 ]
Tsang, Ivor W. [2 ]
Wang, Li [3 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
[2] Univ Technol Sydney, Ctr Quantum Computat & Intelligent Syst, Broadway, NSW 2007, Australia
[3] Univ Calif San Diego, Dept Math, La Jolla, CA 92093 USA
基金
澳大利亚研究理事会;
关键词
big data; ultrahigh dimensionality; feature selection; nonlinear feature selection; multiple kernel learning; feature generation; MULTIPLE; CLASSIFICATION; OPTIMIZATION; CONVERGENCE; ONLINE; CANCER; LASSO;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with O(10(14)) features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency.
引用
收藏
页码:1371 / 1429
页数:59
相关论文
共 50 条
  • [31] A greedy feature selection algorithm for Big Data of high dimensionality
    Tsamardinos, Ioannis
    Borboudakis, Giorgos
    Katsogridakis, Pavlos
    Pratikakis, Polyvios
    Christophides, Vassilis
    MACHINE LEARNING, 2019, 108 (02) : 149 - 202
  • [32] Feature Selection Techniques for Big Data Analytics
    Albattah, Waleed
    Khan, Rehan Ullah
    Alsharekh, Mohammed F.
    Khasawneh, Samer F.
    ELECTRONICS, 2022, 11 (19)
  • [33] Lightweight Feature Selection Methods Based on Standardized Measure of Dispersion for Mining Big Data
    Fong, Simon
    Biuk-Aghai, Robert P.
    Si, Yain-Whar
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (CIT), 2016, : 553 - 559
  • [34] Feature screening and variable selection for partially linear models with ultrahigh-dimensional longitudinal data
    Liu, Jingyuan
    NEUROCOMPUTING, 2016, 195 : 202 - 210
  • [35] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
  • [36] Ensemble feature selection for high dimensional data: a new method and a comparative study
    Ben Brahim, Afef
    Limam, Mohamed
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (04) : 937 - 952
  • [37] Feature selection in ultrahigh-dimensional additive models with heterogeneous frequency component functions
    Liu, Yuyang
    Luo, Shan
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2023, 225 : 243 - 265
  • [38] Feature selection for high dimensional imbalanced class data using harmony search
    Moayedikia, Alireza
    Ong, Kok-Leong
    Boo, Yee Ling
    Yeoh, William G. S.
    Jensen, Richard
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 57 : 38 - 49
  • [39] A graph based preordonnances theoretic supervised feature selection in high dimensional data
    Chamlal, Hasna
    Ouaderhman, Tayeb
    Aaboub, Fadwa
    KNOWLEDGE-BASED SYSTEMS, 2022, 257
  • [40] Feature Selection for High-Dimensional Data Through Instance Vote Combining
    Chamakura, Lily
    Saha, Goutam
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 161 - 169