Comparing feature selection methods for high-dimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

被引:0
|
作者
Fernandez-Gutierrez, Fabiola [1 ]
Kennedy, Jonathan I. [1 ]
Zhou, Shang-Ming [1 ]
Cooksey, Roxanne [1 ]
Atkinson, Mark [1 ]
Brophy, Sinead [1 ]
机构
[1] Swansea Univ, Coll Med, Farr Inst Hlth Informat Res, Swansea, W Glam, Wales
来源
2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND SYSTEMS MANAGEMENT (IESM) | 2015年
关键词
data mining; feature selection; rheumatoid arthritis; high-dimensional data; imbalanced data; DISEASE; LEFLUNOMIDE; EFFICACY; SAFETY; DRUG;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods-Binomial distribution, Chi-square, Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.
引用
收藏
页码:236 / 241
页数:6
相关论文
共 50 条
  • [21] Feature selection based on geometric distance for high-dimensional data
    Lee, J. -H.
    Oh, S. -Y.
    ELECTRONICS LETTERS, 2016, 52 (06) : 473 - 474
  • [22] Filter Feature Selection Performance Comparison in High-dimensional Data
    Huertas, Carlos
    Juarez-Ramirez, Reyes
    2014 17TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2014,
  • [23] Many-Objective Jaccard-Based Evolutionary Feature Selection for High-Dimensional Imbalanced Data Classification
    Saadatmand, H.
    Akbarzadeh-T, Mohammad-R.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8820 - 8835
  • [24] Clustering of imbalanced high-dimensional media data
    Brodinova, Sarka
    Zaharieva, Maia
    Filzmoser, Peter
    Ortner, Thomas
    Breiteneder, Christian
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (02) : 261 - 284
  • [25] Identifying rheumatoid arthritis susceptibility genes using high-dimensional methods
    Xueying Liang
    Ying Gao
    Tram K Lam
    Qizhai Li
    Cathy Falk
    Xiaohong R Yang
    Alisa M Goldstein
    Lynn R Goldin
    BMC Proceedings, 3 (Suppl 7)
  • [26] Hybrid Feature Selection for High-Dimensional Manufacturing Data
    Sun, Yajuan
    Yu, Jianlin
    Li, Xiang
    Wu, Ji Yan
    Lu, Wen Feng
    2021 26TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2021,
  • [27] A hybrid feature selection method for high-dimensional data
    Taheri, Nooshin
    Nezamabadi-pour, Hossein
    2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 141 - 145
  • [28] Simultaneous Feature and Model Selection for High-Dimensional Data
    Perolini, Alessandro
    Guerif, Sebastien
    2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, : 47 - 50
  • [29] Simultaneous Feature Selection and Classification for High-Dimensional Data
    Pai, Vriddhi
    Gupta, Subhash Chand
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND INTERNET OF THINGS (ICGCIOT 2018), 2018, : 153 - 158
  • [30] High-Dimensional Software Engineering Data and Feature Selection
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Gao, Kehan
    Seliya, Naeem
    ICTAI: 2009 21ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, 2009, : 83 - +