Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research

被引:9
|
作者
Wald, Randall [1 ]
Khoshgoftaar, Taghi [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年
关键词
Bioinformatics; Random Forest; feature selection; GENE-EXPRESSION;
D O I
10.1109/ICMLA.2013.34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many problems in bioinformatics involve high-dimensional, difficult-to-process collections of data. For example, gene microarrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification models on such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (or in the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.
引用
收藏
页码:154 / 160
页数:7
相关论文
共 50 条
  • [41] Features processing for random forest optimization in lung nodule localization
    El-Askary, Nada S.
    Salem, Mohammed A. -M.
    Roushdy, Mohamed I.
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 193
  • [42] A method for modulation recognition based on entropy features and random forest
    Zhang, Zhen
    Li, Yibing
    Zhu, Xiaolei
    Lin, Yun
    2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2017, : 243 - 246
  • [43] Prediction of Alzheimer?s Using Random Forest with Radiomic Features
    Singh, Anuj
    Kumar, Raman
    Tiwari, Arvind Kumar
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 45 (01): : 513 - 530
  • [44] Predicting optimal parameters with random forest for quantum key distribution
    Ding, Hua-Jian
    Liu, Jing-Yang
    Zhang, Chun-Mei
    Wang, Qin
    QUANTUM INFORMATION PROCESSING, 2020, 19 (02)
  • [45] Predicting optimal parameters with random forest for quantum key distribution
    Hua-Jian Ding
    Jing-Yang Liu
    Chun-Mei Zhang
    Qin Wang
    Quantum Information Processing, 2020, 19
  • [46] ECG Delineation with Randomly Selected Wavelet Feature and Random Forest Classifier
    Fu, Dapeng
    Xia, Zhourui
    Gao, Pengfei
    Wang, Haiqing
    Lin, Jianping
    Sun, Li
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (08): : 2082 - 2091
  • [47] Forest-ORE: Mining an optimal rule ensemble to interpret random forest models
    Haddouchi, Maissae
    Berrado, Abdelaziz
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 143
  • [48] Optimal Feature Selection for Partial Discharge Recognition of Cable Systems Based on the Random Forest Method
    Peng, Xiaosheng
    Yang, Guangyao
    Zheng, Shijie
    Xiong, Lei
    Bai, Junyang
    2016 CHINA INTERNATIONAL CONFERENCE ON ELECTRICITY DISTRIBUTION (CICED), 2016,
  • [49] Random Forest Based Optimal Feature Selection for Partial Discharge Pattern Recognition in HV Cables
    Peng, Xiaosheng
    Li, Jinshu
    Wang, Ganjun
    Wu, Yijiang
    Li, Lee
    Li, Zhaohui
    Bhatti, Ashfaque Ahmed
    Zhou, Chengke
    Hepburn, Donald M.
    Reid, Alistair J.
    Judd, Martin D.
    Siew, Wan Hoon
    IEEE TRANSACTIONS ON POWER DELIVERY, 2019, 34 (04) : 1715 - 1724
  • [50] A Bankruptcy Prediction Model Using Random Forest
    Joshi, Shreya
    Ramesh, Rachana
    Tahsildar, Shagufta
    PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 1722 - 1727