Comparing feature selection methods for high-dimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

被引:0
|
作者
Fernandez-Gutierrez, Fabiola [1 ]
Kennedy, Jonathan I. [1 ]
Zhou, Shang-Ming [1 ]
Cooksey, Roxanne [1 ]
Atkinson, Mark [1 ]
Brophy, Sinead [1 ]
机构
[1] Swansea Univ, Coll Med, Farr Inst Hlth Informat Res, Swansea, W Glam, Wales
来源
2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND SYSTEMS MANAGEMENT (IESM) | 2015年
关键词
data mining; feature selection; rheumatoid arthritis; high-dimensional data; imbalanced data; DISEASE; LEFLUNOMIDE; EFFICACY; SAFETY; DRUG;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods-Binomial distribution, Chi-square, Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.
引用
收藏
页码:236 / 241
页数:6
相关论文
共 50 条
  • [31] Evaluating Feature Selection Robustness on High-Dimensional Data
    Pes, Barbara
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS (HAIS 2018), 2018, 10870 : 235 - 247
  • [32] Neighborhood Component Feature Selection for High-Dimensional Data
    Yang, Wei
    Wang, Kuanquan
    Zuo, Wangmeng
    JOURNAL OF COMPUTERS, 2012, 7 (01) : 161 - 168
  • [33] A Cost-Sensitive Feature Selection Method for High-Dimensional Data
    An, Chaojie
    Zhou, Qifeng
    14TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION (ICCSE 2019), 2019, : 1089 - 1094
  • [34] Discriminative Ridge Machine: A Classifier for High-Dimensional Data or Imbalanced Data
    Peng, Chong
    Cheng, Qiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (06) : 2595 - 2609
  • [35] A GA-based Feature Selection for High-dimensional Data Clustering
    Sun, Mei
    Xiong, Langhuan
    Sun, Haojun
    Jiang, Dazhi
    THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 769 - 772
  • [36] Improving Evolutionary Algorithm Performance for Feature Selection in High-Dimensional Data
    Cilia, N.
    De Stefano, C.
    Fontanella, F.
    di Freca, A. Scotto
    APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2018, 2018, 10784 : 439 - 454
  • [37] Genetic programming for feature construction and selection in classification on high-dimensional data
    Binh Tran
    Bing Xue
    Mengjie Zhang
    Memetic Computing, 2016, 8 : 3 - 15
  • [38] Genetic programming for feature construction and selection in classification on high-dimensional data
    Binh Tran
    Xue, Bing
    Zhang, Mengjie
    MEMETIC COMPUTING, 2016, 8 (01) : 3 - 15
  • [39] A hybrid feature weighting and selection-based strategy to classify the high-dimensional and imbalanced medical data
    Singh H.
    Kaur M.
    Singh B.
    Neural Computing and Applications, 2024, 36 (20) : 12299 - 12316
  • [40] A Hybrid Feature Selection Algorithm Applied to High-dimensional Imbalanced Small-sample Data Classification
    Feng, Fang
    Lv, Qingquan
    Wang, Mingsong
    Yang, Xuhui
    Zhou, Qingguo
    Zhou, Rui
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 41 - 46