Identifying representative trees from ensembles

被引:49
作者
Banerjee, Mousumi [1 ]
Ding, Ying [1 ]
Noone, Anne-Michelle [2 ]
机构
[1] Univ Michigan, Dept Biostat, Ann Arbor, MI 48109 USA
[2] Georgetown Univ, Dept Biostat Bioinformat & Biomath, Lombardi Comprehens Canc Ctr, Washington, DC 20057 USA
关键词
bagging; random forest; tree similarity metric; representative trees; out-of-bag error; CLASSIFICATION; RISK; RECURRENCE; MODEL;
D O I
10.1002/sim.4492
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Tree-based methods have become popular for analyzing complex data structures where the primary goal is risk stratification of patients. Ensemble techniques improve the accuracy in prediction and address the instability in a single tree by growing an ensemble of trees and aggregating. However, in the process, individual trees get lost. In this paper, we propose a methodology for identifying the most representative trees in an ensemble on the basis of several tree distance metrics. Although our focus is on binary outcomes, the methods are applicable to censored data as well. For any two trees, the distance metrics are chosen to (1) measure similarity of the covariates used to split the trees; (2) reflect similar clustering of patients in the terminal nodes of the trees; and (3) measure similarity in predictions from the two trees. Whereas the latter focuses on prediction, the first two metrics focus on the architectural similarity between two trees. The most representative trees in the ensemble are chosen on the basis of the average distance between a tree and all other trees in the ensemble. Out-of-bag estimate of error rate is obtained using neighborhoods of representative trees. Simulations and data examples show gains in predictive accuracy when averaging over such neighborhoods. We illustrate our methods using a dataset of kidney cancer treatment receipt (binary outcome) and a second dataset of breast cancer survival (censored outcome). Copyright (C) 2012 John Wiley & Sons, Ltd.
引用
收藏
页码:1601 / 1616
页数:16
相关论文
共 29 条
[1]  
Banerjee M, 2000, CANCER, V89, P404, DOI 10.1002/1097-0142(20000715)89:2<404::AID-CNCR28>3.0.CO
[2]  
2-M
[3]   Tree-based model for breast cancer prognostication [J].
Banerjee, M ;
George, J ;
Song, EY ;
Roy, A ;
Hryniuk, W .
JOURNAL OF CLINICAL ONCOLOGY, 2004, 22 (13) :2567-2575
[4]  
Banerjee M, 2008, WILEY SER PROBAB ST, P265
[5]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]  
Chipman HA, 2001, ARTIFICIAL INTELLIGE
[9]  
Cutler Adele, 2009, P83, DOI 10.1007/978-0-387-69765-9_5
[10]   Recursive partitioning identifies patients at high and low risk for ipsilateral tumor recurrence after breast-conserving surgery and radiation [J].
Freedman, GM ;
Hanlon, AL ;
Fowble, BL ;
Anderson, PR ;
Nicoloau, N .
JOURNAL OF CLINICAL ONCOLOGY, 2002, 20 (19) :4015-4021