Combining clustering of variables and feature selection using random forests

被引:29
作者
Chavent, Marie [1 ,2 ]
Genuer, Robin [3 ,4 ]
Saracco, Jerome [1 ,2 ]
机构
[1] Univ Bordeaux, CNRS, UMR 5251, Inst Math Bordeaux, Talence, France
[2] INRIA Bordeaux Sud Ouest, CQFD Team, Talence, France
[3] Univ Bordeaux, INSERM, U1219, ISPED, 146 Rue Leo Saignat, F-33076 Bordeaux, France
[4] INRIA Bordeaux Sud Ouest, SISTM Team, Talence, France
关键词
clustering of variables; random forests; supervised classification; variable selection; R PACKAGE; REGRESSION;
D O I
10.1080/03610918.2018.1563145
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Standard approaches to tackle high-dimensional supervised classification often include variable selection and dimension reduction. The proposed methodology combines clustering of variables and feature selection. Hierarchical clustering of variables allows to built groups of correlated variables and summarizes each group by a synthetic variable. Originality is that groups of variables are unknown a priori. Moreover clustering approach deals with both numerical and categorical variables. Among all the possible partitions, the most relevant synthetic variables are selected with a procedure using random forests. Numerical performances are illustrated on simulated and real datasets. Selection of groups of variables provides easier interpretation of results.
引用
收藏
页码:426 / 445
页数:20
相关论文
共 27 条
[1]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]  
[Anonymous], 2006, J R STAT SOC B
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Breiman L., 1984, STAT PROBABILITY SER, DOI 10.1201/9781315139470
[5]   Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems [J].
Cao, Kim-Anh Le ;
Boitard, Simon ;
Besse, Philippe .
BMC BIOINFORMATICS, 2011, 12
[6]  
Chavent M., 2017, ARXIV PREPRINT ARXIV
[7]  
Chavent M, 2012, J STAT SOFTW, V50, P1
[8]   Orthogonal rotation in PCAMIX [J].
Chavent, Marie ;
Kuentz-Simonet, Vanessa ;
Saracco, Jerome .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2012, 6 (02) :131-146
[9]   Sparse partial least squares regression for simultaneous dimension reduction and variable selection [J].
Chun, Hyonho ;
Keles, Suenduez .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2010, 72 :3-25
[10]  
Fernández-Delgado M, 2014, J MACH LEARN RES, V15, P3133