Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research

被引：9

作者：

Wald, Randall ^{[1
]}

Khoshgoftaar, Taghi ^{[1
]}

Dittman, David J. ^{[1
]}

Napolitano, Amri ^{[1
]}

机构：

[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA

来源：

2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年

关键词：

Bioinformatics; Random Forest; feature selection; GENE-EXPRESSION;

D O I：

10.1109/ICMLA.2013.34

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many problems in bioinformatics involve high-dimensional, difficult-to-process collections of data. For example, gene microarrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification models on such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (or in the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.

引用

页码：154 / 160

页数：7

共 50 条

[21] Similarity based on the importance of common features in random forest
Chen X.
Han L.
Leng M.
Pan X.
International Journal of Performability Engineering, 2019, 15 (04) : 1171 - 1180
[22] Salary Prediction using Random Forest with Fundamental Features
Chen, Jingyi
Mao, Shuming
Yuan, Qixuan
THIRD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION; NETWORK AND COMPUTER TECHNOLOGY (ECNCT 2021), 2022, 12167
[23] Random Forest Model with Combined Features: A Practical Approach to Predict Liquid-crystalline Property
Chen, Chia-Hsiu
Tanaka, Kenichi
Funatsu, Kimito
MOLECULAR INFORMATICS, 2019, 38 (04)
[24] Optimal Feature Set Size in Random Forest Regression
Han, Sunwoo
Kim, Hyunjoong
APPLIED SCIENCES-BASEL, 2021, 11 (08):
[25] Features Selection in Character Recognition with Random Forest Classifier
Homenda, Wladyslaw
Lesinski, Wojciech
COMPUTATIONAL COLLECTIVE INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS, PT I, 2011, 6922 : 93 - +
[26] On the Optimal Size of Candidate Feature Set in Random forest
Han, Sunwoo
Kim, Hyunjoong
APPLIED SCIENCES-BASEL, 2019, 9 (05):
[27] An efficient model for detecting COVID fake news using optimal lightweight convolutional random forest
S. Selva Birunda
R. Kanniga Devi
M. Muthukannan
Signal, Image and Video Processing, 2024, 18 : 2659 - 2669
[28] An efficient model for detecting COVID fake news using optimal lightweight convolutional random forest
Birunda, S. Selva
Devi, R. Kanniga
Muthukannan, M.
SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (03) : 2659 - 2669
[29] A Bayesian Network Model for Real-time Crash Prediction Based on Selected Variables by Random Forest
Wu, Mingxian
Shan, Donghui
Wang, Zuo
Sun, Xiaoduan
Liu, Jianbei
Sun, Ming
2019 5TH INTERNATIONAL CONFERENCE ON TRANSPORTATION INFORMATION AND SAFETY (ICTIS 2019), 2019, : 670 - 677
[30] Global Stress Detection Framework Combining a Reduced Set of HRV Features and Random Forest Model
Dahal, Kamana
Bogue-Jimenez, Brian
Doblas, Ana
SENSORS, 2023, 23 (11)

← 1 2 3 4 5 →