Comparing feature selection methods for high-dimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

被引:0
|
作者
Fernandez-Gutierrez, Fabiola [1 ]
Kennedy, Jonathan I. [1 ]
Zhou, Shang-Ming [1 ]
Cooksey, Roxanne [1 ]
Atkinson, Mark [1 ]
Brophy, Sinead [1 ]
机构
[1] Swansea Univ, Coll Med, Farr Inst Hlth Informat Res, Swansea, W Glam, Wales
来源
2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND SYSTEMS MANAGEMENT (IESM) | 2015年
关键词
data mining; feature selection; rheumatoid arthritis; high-dimensional data; imbalanced data; DISEASE; LEFLUNOMIDE; EFFICACY; SAFETY; DRUG;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods-Binomial distribution, Chi-square, Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.
引用
收藏
页码:236 / 241
页数:6
相关论文
共 50 条
  • [41] Multistage feature selection approach for high-dimensional cancer data
    Alkuhlani, Alhasan
    Nassef, Mohammad
    Farag, Ibrahim
    SOFT COMPUTING, 2017, 21 (22) : 6895 - 6906
  • [42] On online high-dimensional spherical data clustering and feature selection
    Amayri, Ola
    Bouguila, Nizar
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2013, 26 (04) : 1386 - 1398
  • [43] Genetic Programming for Feature Selection and Construction to High-Dimensional Data
    Ma, Jianbin
    Zhu, Man
    2024 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING, MLISE 2024, 2024, : 196 - 200
  • [44] Multistage feature selection approach for high-dimensional cancer data
    Alhasan Alkuhlani
    Mohammad Nassef
    Ibrahim Farag
    Soft Computing, 2017, 21 : 6895 - 6906
  • [45] Feature selection from high-dimensional hyperspectral and polarimetric data for target detection
    Chen, XW
    Casasent, D
    OPTICAL PATTERN RECOGNITION XV, 2004, 5437 : 171 - 178
  • [46] Single Sequence Fast Feature Selection for High-Dimensional Data
    Boldt, Francisco de Assis
    Rauber, Thomas W.
    Varejao, Flavio M.
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 697 - 704
  • [47] Enhancing protection in high-dimensional data: Distributed differential privacy with feature selection
    Putrama, I. Made
    Martinek, Peter
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (06)
  • [48] A density-based clustering algorithm for high-dimensional data with feature selection
    Qi Xianting
    Wang Pan
    2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 114 - 118
  • [49] Feature selection using symmetric uncertainty and hybrid optimization for high-dimensional data
    Lin Sun
    Shujing Sun
    Weiping Ding
    Xinyue Huang
    Peiyi Fan
    Kunyu Li
    Leqi Chen
    International Journal of Machine Learning and Cybernetics, 2023, 14 : 4339 - 4360
  • [50] Mrmr plus and Cfs plus feature selection algorithms for high-dimensional data
    Angulo, Adrian Pino
    Shin, Kilho
    APPLIED INTELLIGENCE, 2019, 49 (05) : 1954 - 1967