Boruta - A System for Feature Selection

被引:522
作者
Kursa, Miron B. [1 ]
Jankowski, Aleksander [1 ]
Rudnicki, Witold R. [1 ]
机构
[1] Univ Warsaw, ICM, Warsaw, Poland
关键词
RANDOM FOREST; CLASSIFICATION;
D O I
10.3233/FI-2010-288
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.
引用
收藏
页码:271 / 286
页数:16
相关论文
共 25 条
  • [1] Bishop CM., 1995, NEURAL NETWORKS PATT
  • [2] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [3] Identifying SNPs predictive of phenotype using random forests
    Bureau, A
    Dupuis, J
    Falls, K
    Lunetta, KL
    Hayward, B
    Keith, TP
    Van Eerdewegh, P
    [J]. GENETIC EPIDEMIOLOGY, 2005, 28 (02) : 171 - 182
  • [4] Gene selection and classification of microarray data using random forest -: art. no. 3
    Díaz-Uriarte, R
    de Andrés, SA
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [5] INVITRO SELECTION OF RNA MOLECULES THAT BIND SPECIFIC LIGANDS
    ELLINGTON, AD
    SZOSTAK, JW
    [J]. NATURE, 1990, 346 (6287) : 818 - 822
  • [6] Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors
    Guha, R
    Jurs, PC
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (06): : 2179 - 2189
  • [7] FAST FOLDING AND COMPARISON OF RNA SECONDARY STRUCTURES
    HOFACKER, IL
    FONTANA, W
    STADLER, PF
    BONHOEFFER, LS
    TACKER, M
    SCHUSTER, P
    [J]. MONATSHEFTE FUR CHEMIE, 1994, 125 (02): : 167 - 188
  • [8] Structural basis of RNA folding and recognition in an AMP-RNA aptamer complex
    Jiang, F
    Kumar, RA
    Jones, RA
    Patel, DJ
    [J]. NATURE, 1996, 382 (6587) : 183 - 186
  • [9] KIERCZAK M, CONSTRUCTION ROUGH S
  • [10] Aptamer database
    Lee, JF
    Hesselberth, JR
    Meyers, LA
    Ellington, AD
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D95 - D100