Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research

被引:9
|
作者
Wald, Randall [1 ]
Khoshgoftaar, Taghi [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年
关键词
Bioinformatics; Random Forest; feature selection; GENE-EXPRESSION;
D O I
10.1109/ICMLA.2013.34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many problems in bioinformatics involve high-dimensional, difficult-to-process collections of data. For example, gene microarrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification models on such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (or in the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.
引用
收藏
页码:154 / 160
页数:7
相关论文
共 50 条
  • [1] DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest
    Manavalan, Balachandran
    Shin, Tae Hwan
    Lee, Gwang
    ONCOTARGET, 2018, 9 (02) : 1944 - 1956
  • [2] Analysis of Resampling Method for Arrhythmia Classification using Random Forest Classifier with Selected Features
    Mohapatra, Saumendra Kumar
    Mohanty, Mihir Narayan
    2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND BUSINESS ANALYTICS (ICDSBA 2018), 2018, : 495 - 499
  • [3] Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features
    Zhao, Xuewei
    Zou, Quan
    Liu, Bin
    Liu, Xiangrong
    CURRENT PROTEOMICS, 2014, 11 (04) : 289 - 299
  • [4] A Method for Lymph Node Segmentation with Scaling Features in a Random Forest Model
    Zhao, Wenjing
    Shi, Feng
    CURRENT PROTEOMICS, 2018, 15 (02) : 128 - 134
  • [5] Seabed sediment classification using multibeam backscatter data based on the selecting optimal random forest model
    Ji, Xue
    Yang, Bisheng
    Tang, Qiuhua
    APPLIED ACOUSTICS, 2020, 167
  • [6] A random forest based biomarker discovery and power analysis framework for diagnostics research
    Acharjee, Animesh
    Larkman, Joseph
    Xu, Yuanwei
    Cardoso, Victor Roth
    Gkoutos, Georgios V.
    BMC MEDICAL GENOMICS, 2020, 13 (01)
  • [7] Melanoma important features selection using random forest approach
    Paja, Wieslaw
    Wrzesien, Mariusz
    2013 6TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTIONS (HSI), 2013, : 415 - 418
  • [8] Research on Feature Selection Methods based on Random Forest
    Wang, Zhuo
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2023, 30 (02): : 623 - 633
  • [9] Research on PV fault diagnosis model based on cascaded random forest
    Ye J.
    Lu Q.
    Wang Y.
    Chang S.
    Chen H.
    Hu L.
    Taiyangneng Xuebao/Acta Energiae Solaris Sinica, 2021, 42 (03): : 358 - 362
  • [10] Research on Wind Power Prediction Model Based on Random Forest and SVR
    Wang Z.
    Chi D.
    EAI Endorsed Transactions on Energy Web, 2024, 11 : 1 - 8