Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research

被引:9
|
作者
Wald, Randall [1 ]
Khoshgoftaar, Taghi [1 ]
Dittman, David J. [1 ]
Napolitano, Amri [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
来源
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 | 2013年
关键词
Bioinformatics; Random Forest; feature selection; GENE-EXPRESSION;
D O I
10.1109/ICMLA.2013.34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many problems in bioinformatics involve high-dimensional, difficult-to-process collections of data. For example, gene microarrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification models on such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (or in the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.
引用
收藏
页码:154 / 160
页数:7
相关论文
共 50 条
  • [31] RANDOM FOREST AS A MODEL FOR CZECH FORECASTING
    Gawthorpe, Katerina
    PRAGUE ECONOMIC PAPERS, 2021, 30 (03): : 336 - 357
  • [32] A Novel Credit Scoring Model based on Optimized Random Forest
    Zhang, Xingzhi
    Yang, Yan
    Zhou, Zhurong
    2018 IEEE 8TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2018, : 60 - 65
  • [33] Research on Recognition Model with Random Forest and Entropy Weight for Chemical Gas Sensor Array
    Dong, Xiaorui
    Qi, Xin
    Cui, Jian
    Xu, Xiaobao
    Wan, Aihui
    PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 236 - 239
  • [34] Economy Research Basing on the Random Forest Method
    Zhang Ren-shou
    Luo Lin-kai
    Ye Ling-jun
    PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON PUBLIC ECONOMICS AND MANAGEMENT ICPEM 2009, VOL 9: SOCIAL SCIENCE METHODOLOGY, 2009, : 230 - 233
  • [35] MetalExplorer, a Bioinformatics Tool for the Improved Prediction of Eight Types of Metal-Binding Sites Using a Random Forest Algorithm with Two-Step Feature Selection
    Song, Jiangning
    Li, Chen
    Zheng, Cheng
    Revote, Jerico
    Zhang, Ziding
    Webb, Geoffrey I.
    CURRENT BIOINFORMATICS, 2017, 12 (06) : 480 - 489
  • [36] Using SVM and Random forest for different features selection in predicting bike rental amount
    Shiao, Yi Chen
    Chung, Wei Hsiang
    Chen, Rung Ching
    2018 9TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST), 2018, : 246 - 250
  • [37] Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
    Tian, Leqi
    Wu, Wenbin
    Yu, Tianwei
    BIOMOLECULES, 2023, 13 (07)
  • [38] Alterations to the Bootstrapping Process Within Random Forest: A Case Study on Imbalanced Bioinformatics Data
    Khoshgoftaar, Taghi M.
    Fazelpour, Alireza
    Dittman, David J.
    Napolitano, Amri
    2015 IEEE 16TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2015, : 342 - 348
  • [39] Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction
    Dobashi, Nao
    Saito, Shota
    Nakahara, Yuta
    Matsushima, Toshiyasu
    ENTROPY, 2021, 23 (06)
  • [40] A random forest based biomarker discovery and power analysis framework for diagnostics research
    Animesh Acharjee
    Joseph Larkman
    Yuanwei Xu
    Victor Roth Cardoso
    Georgios V. Gkoutos
    BMC Medical Genomics, 13