A Frequency-Based Gene Selection Method with Random Forests for Gene Data Analysis

被引:0
作者
Thanh Trinh [1 ]
Wu, DingMing [1 ]
Salloum, Salman [1 ]
Tung Nguyen [2 ]
Huang, Joshua Zhexue [1 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
[2] Thuyloi Univ, Fac Comp Sci & Engn, Hanoi, Vietnam
来源
2016 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF) | 2016年
关键词
Classification; gene selection; random forest; symmetrical uncertainty; CLASSIFICATION; PREDICTION; PATTERNS; CANCER; TUMOR;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Gene selection is an important step in analysis of gene data sets in which the number of genes exceeds greatly the number of samples. In this paper, we propose a new method that uses a random forest model to select genes from high dimensional gene data sets. In this method, Breiman's random forest algorithm is first used to generate a random forest model from a high dimensional data set. Then, features appearing in component tree models of the random forest are analyzed using the measures of feature correlations. Features are divided into two sets, those appearing in the roots of component trees and those appearing in other nodes of the trees. The frequency of the features is calculated and the features whose frequency is greater than given thresholds are selected as candidates. Finally, the correlation of candidate features with the class feature is measured with symmetrical uncertainty and the top features (with the highest symmetrical uncertainty values) are selected. 19 gene data sets were used to evaluate the new gene selection method. The comparison results have shown that the models built with the gene features selected with the new method outperformed other random forest models in classification accuracy.
引用
收藏
页码:193 / 198
页数:6
相关论文
共 24 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[4]   A review of microarray datasets and applied feature selection methods [J].
Bolon-Canedo, V. ;
Sanchez-Marono, N. ;
Alonso-Betanzos, A. ;
Benitez, J. M. ;
Herrera, F. .
INFORMATION SCIENCES, 2014, 282 :111-135
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Deng H., 2013, CoRR
[7]   Gene selection with guided regularized random forest [J].
Deng, Houtao ;
Runger, George .
PATTERN RECOGNITION, 2013, 46 (12) :3483-3489
[8]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[9]  
Donoho D. L., 2000, AMS math challenges lecture, V1, P32
[10]  
Fernández-Delgado M, 2014, J MACH LEARN RES, V15, P3133