Comparing feature selection methods for high-dimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

被引：0

作者：

Fernandez-Gutierrez, Fabiola ^{[1
]}

Kennedy, Jonathan I. ^{[1
]}

Zhou, Shang-Ming ^{[1
]}

Cooksey, Roxanne ^{[1
]}

Atkinson, Mark ^{[1
]}

Brophy, Sinead ^{[1
]}

机构：

[1] Swansea Univ, Coll Med, Farr Inst Hlth Informat Res, Swansea, W Glam, Wales

来源：

2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND SYSTEMS MANAGEMENT (IESM) | 2015年

关键词：

data mining; feature selection; rheumatoid arthritis; high-dimensional data; imbalanced data; DISEASE; LEFLUNOMIDE; EFFICACY; SAFETY; DRUG;

D O I：

暂无

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods-Binomial distribution, Chi-square, Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.

引用

页码：236 / 241

页数：6

共 50 条

[31] Evaluating Feature Selection Robustness on High-Dimensional Data
Pes, Barbara
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS (HAIS 2018), 2018, 10870 : 235 - 247
[32] Neighborhood Component Feature Selection for High-Dimensional Data
Yang, Wei
Wang, Kuanquan
Zuo, Wangmeng
JOURNAL OF COMPUTERS, 2012, 7 (01) : 161 - 168
[33] A Cost-Sensitive Feature Selection Method for High-Dimensional Data
An, Chaojie
Zhou, Qifeng
14TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION (ICCSE 2019), 2019, : 1089 - 1094
[34] Discriminative Ridge Machine: A Classifier for High-Dimensional Data or Imbalanced Data
Peng, Chong
Cheng, Qiang
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (06) : 2595 - 2609
[35] A GA-based Feature Selection for High-dimensional Data Clustering
Sun, Mei
Xiong, Langhuan
Sun, Haojun
Jiang, Dazhi
THIRD INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTING, 2009, : 769 - 772
[36] Improving Evolutionary Algorithm Performance for Feature Selection in High-Dimensional Data
Cilia, N.
De Stefano, C.
Fontanella, F.
di Freca, A. Scotto
APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2018, 2018, 10784 : 439 - 454
[37] Genetic programming for feature construction and selection in classification on high-dimensional data
Binh Tran
Bing Xue
Mengjie Zhang
Memetic Computing, 2016, 8 : 3 - 15
[38] Genetic programming for feature construction and selection in classification on high-dimensional data
Binh Tran
Xue, Bing
Zhang, Mengjie
MEMETIC COMPUTING, 2016, 8 (01) : 3 - 15
[39] A hybrid feature weighting and selection-based strategy to classify the high-dimensional and imbalanced medical data
Singh H.
Kaur M.
Singh B.
Neural Computing and Applications, 2024, 36 (20) : 12299 - 12316
[40] A Hybrid Feature Selection Algorithm Applied to High-dimensional Imbalanced Small-sample Data Classification
Feng, Fang
Lv, Qingquan
Wang, Mingsong
Yang, Xuhui
Zhou, Qingguo
Zhou, Rui
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 41 - 46

← 1 2 3 4 5 →