Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity

被引:40
作者
Cao, Dong-Sheng [1 ]
Xu, Qing-Song [2 ]
Liang, Yi-Zeng [1 ]
Chen, Xian [1 ]
Li, Hong-Dong [1 ]
机构
[1] Cent South Univ, Res Ctr Modernizat Tradit Chinese Med, Changsha 410083, Peoples R China
[2] Cent South Univ, Sch Math Sci & Comp Technol, Changsha 410083, Peoples R China
关键词
Feature selection; Bagging; Boosting; Random Forest (RF); Classification and Regression Tree (CART); Ensemble learning; QSAR MODELS; COMPOUND CLASSIFICATION; RANDOM FOREST; REGRESSION; INHIBITORS; QSPR; TOOL;
D O I
10.1016/j.chemolab.2010.06.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the structure-activity relationship (SAR) study, a learning algorithm is usually faced with the problem of selecting a compact subset of descriptors related to the property of interest, while ignoring the rest. This paper presents a new method of molecular descriptor selection utilizing three commonly used decision tree (DT)-based ensemble methods coupled with a backward elimination strategy (BES). Our proposed method eliminates descriptor redundancy automatically and searches for more compact descriptor subset tailored to DT-based ensemble methods. Six real SAR datasets related to different categorical bioactivities of compounds are used to evaluate the proposed method. The results obtained in this study indicate that DT-based ensemble methods coupled with BES, especially boosting tree model, yield better classification performance for compounds related to ADMET. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:129 / 136
页数:8
相关论文
共 47 条
[1]   On the use of neural network ensembles in QSAR and QSPR [J].
Agrafiotis, DK ;
Cedeño, W ;
Lobanov, VS .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2002, 42 (04) :903-911
[2]   Application of QSPR to mixtures [J].
Ajmani, Subhash ;
Rogers, Stephen C. ;
Barley, Mark H. ;
Livingstone, David J. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (05) :2043-2055
[3]  
[Anonymous], 2006, Pattern recognition and machine learning
[4]  
ARODZ T, 2005, ENSEMBLE LINEAR MODE, P416
[5]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   SVM-based feature selection for characterization of focused compound collections [J].
Byvatov, E ;
Schneider, G .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (03) :993-999
[9]   Computation of octanol-water partition coefficients by guiding an additive model with knowledge [J].
Cheng, Tiejun ;
Zhao, Yuan ;
Li, Xun ;
Lin, Fu ;
Xu, Yong ;
Zhang, Xinglong ;
Li, Yan ;
Wang, Renxiao ;
Lai, Luhua .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (06) :2140-2148
[10]   Ensemble feature selection: Consistent descriptor subsets for multiple QSAR models [J].
Dutta, Debojyoti ;
Guha, Rajarshi ;
Wild, David ;
Chen, Ting .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (03) :989-997