Comparing feature selection methods for high-dimensional imbalanced data: identifying rheumatoid arthritis cohorts from routine data

被引：0

作者：

Fernandez-Gutierrez, Fabiola ^{[1
]}

Kennedy, Jonathan I. ^{[1
]}

Zhou, Shang-Ming ^{[1
]}

Cooksey, Roxanne ^{[1
]}

Atkinson, Mark ^{[1
]}

Brophy, Sinead ^{[1
]}

机构：

[1] Swansea Univ, Coll Med, Farr Inst Hlth Informat Res, Swansea, W Glam, Wales

来源：

2015 INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND SYSTEMS MANAGEMENT (IESM) | 2015年

关键词：

data mining; feature selection; rheumatoid arthritis; high-dimensional data; imbalanced data; DISEASE; LEFLUNOMIDE; EFFICACY; SAFETY; DRUG;

D O I：

暂无

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Linkage of routine and administrative databases from multiple sources provides an advantageous form of understanding chronic diseases, such as arthropathy conditions. Data mining classification algorithms can be a cost-effective approach to identify patients' cohorts with certain disorders within these complex databases. However, selecting good potential predictors, given a certain condition from a patient's history with huge health records, can be challenging, particularly with small prevalence proportion, which leads to a high-dimensional imbalanced data space. A Feature Selection (FS) methodology is proposed to overcome this problem, providing a fast way to find relevant predictors, improving potentially the performance of the classifiers. This study compared the performance of five FS methods-Binomial distribution, Chi-square, Information Gain, GINI and DKM - using as the exemplar a dataset with routine data from the Abertawe Bro Morgannwg University Health Board (UK) linked to a rheumatoid specialized database (CELLMA) for Rheumatoid Arthritis patients identification. Preliminary results showed that it was possible to reduce an initial list of 36243 possible predictors to less than 200 to obtain a desirable performance in identifying RA patients. Chi-square and GINI selected combinations of predictors with highest accuracy and positive predictive values earlier than the other methods.

引用

页码：236 / 241

页数：6

共 50 条

[1] Feature selection for high-dimensional imbalanced data
Yin, Liuzhi
Ge, Yong
Xiao, Keli
Wang, Xuehua
Quan, Xiaojun
NEUROCOMPUTING, 2013, 105 : 3 - 11
[2] Feature selection for high-dimensional data
Bolón-Canedo V.
Sánchez-Maroño N.
Alonso-Betanzos A.
Progress in Artificial Intelligence, 2016, 5 (2) : 65 - 75
[3] On the scalability of feature selection methods on high-dimensional data
V. Bolón-Canedo
D. Rego-Fernández
D. Peteiro-Barral
A. Alonso-Betanzos
B. Guijarro-Berdiñas
N. Sánchez-Maroño
Knowledge and Information Systems, 2018, 56 : 395 - 442
[4] On the scalability of feature selection methods on high-dimensional data
Bolon-Canedo, V.
Rego-Fernandez, D.
Peteiro-Barral, D.
Alonso-Betanzos, A.
Guijarro-Berdinas, B.
Sanchez-Marono, N.
KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 56 (02) : 395 - 442
[5] A filter feature selection for high-dimensional data
Janane, Fatima Zahra
Ouaderhman, Tayeb
Chamlal, Hasna
JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
[6] Feature selection using autoencoders with Bayesian methods to high-dimensional data
Shu, Lei
Huang, Kun
Jiang, Wenhao
Wu, Wenming
Liu, Hongling
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (06) : 7397 - 7406
[7] Benchmark for filter methods for feature selection in high-dimensional classification data
Bommert, Andrea
Sun, Xudong
Bischl, Bernd
Rahnenfuehrer, Joerg
Lang, Michel
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 143
[8] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
Verleysen, Michel
ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,
[9] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
Verleysen, Michel
NCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL COMPUTATION THEORY AND APPLICATIONS, 2011, : IS23 - IS25
[10] Feature selection for high-dimensional data
Destrero A.
Mosci S.
De Mol C.
Verri A.
Odone F.
Computational Management Science, 2009, 6 (1) : 25 - 40

← 1 2 3 4 5 →