Efficient variable selection batch pruning algorithm for artificial neural networks

被引:8
作者
Kovalishyn, Vasyl [1 ]
Poda, Gennady [2 ,3 ]
机构
[1] Inst Bioorgan Chem & Petrochem, Dept Med & Biol Res, UA-02660 Kiev, Ukraine
[2] MaRS Ctr, Ontario Inst Canc Res, Drug Discovery Program, Toronto, ON M5G 0A3, Canada
[3] Univ Toronto, Leslie Dan Fac Pharm, Toronto, ON M5S 3M2, Canada
关键词
Artificial neural networks (ANN); Associative neural network (ASNN); Batch pruning algorithm (BPA); Chemometrics; k-Nearest neighbors (k-NN); Self-organizing map (SOM) of Kohonen; Machine learning; Variable selection; VOLUME LEARNING ALGORITHM; DESIGN;
D O I
10.1016/j.chemolab.2015.10.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Here we report a novel, fast and efficient algorithm for variable selection, the batch pruning algorithm (BPA). The method combines the artificial neural networks (ANN) ensemble learning and self-organized map (SOM) of Kohonen for clustering of descriptors, followed up with a selection of an optimal smaller subset of descriptors from each cluster based on calculated sensitivity of input neurons. BPA was validated on two publicly available, structurally diverse datasets: 584 inhibitors of M. Tuberculosis (MTB) growth and 1015 phosphodiesterase type 4 (PDE4) inhibitors. BPA was able to identify a smaller subset of 5% of molecular descriptors (out of about 1200 calculated with Talete Dragon) 50-100 times faster compared to conventional stepwise pruning methods (SPM), and yielded QSAR models of similar or slightly better accuracy as measured by Q(2) (0.73-0.77), RMSE (0.50-0.72) and MAE (0.36-0.57). 97% of compounds were predicted within 1 log unit. It took only 1.47 h to find the best set of descriptors by BPA compared to 119 h by ANN SPM for the MTB dataset, and 3.0 h compared to 237 h for the PDE4 set. Due to its high predictive accuracy and speed, BPA may find wide applicability in building better machine learning models to predict activity, selectivity, physical and ADMET properties for large datasets, and a large number of descriptors within reasonable time. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:10 / 16
页数:7
相关论文
共 31 条
[1]   Synthesis and structure-antibacterial activity relationship investigation of isomeric 2,3,5-substituted perhydropyrrolo[3,4-d]isoxazole-4,6-diones [J].
Agirbas, Hikmet ;
Guner, Selahaddin ;
Budak, Fatma ;
Keceli, Sema ;
Kandemirli, Fatma ;
Shvets, Nathaly ;
Kovalishyn, Vasyl ;
Dimoglo, Anatholy .
BIOORGANIC & MEDICINAL CHEMISTRY, 2007, 15 (06) :2322-2333
[2]  
Berglund A, 1997, J CHEMOMETR, V11, P141, DOI 10.1002/(SICI)1099-128X(199703)11:2<141::AID-CEM461>3.0.CO
[3]  
2-2
[4]  
Breiman L., 1984, CLASSIFICATION REGRE
[5]   Application of artificial neural networks for the prediction of sulfur polycyclic aromatic compounds retention indices [J].
Can, H ;
Dimoglo, A ;
Kovalishyn, V .
JOURNAL OF MOLECULAR STRUCTURE-THEOCHEM, 2005, 723 (1-3) :183-188
[6]   COMPARATIVE MOLECULAR-FIELD ANALYSIS (COMFA) .1. EFFECT OF SHAPE ON BINDING OF STEROIDS TO CARRIER PROTEINS [J].
CRAMER, RD ;
PATTERSON, DE ;
BUNCE, JD .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1988, 110 (18) :5959-5967
[7]   Towards optimal descriptor subset selection with support vector machines in classification and regression [J].
Fröhlich, H ;
Wegner, JK ;
Zell, A .
QSAR & COMBINATORIAL SCIENCE, 2004, 23 (05) :311-318
[8]   Rational selection of training and test sets for the development of validated QSAR models [J].
Golbraikh, A ;
Shen, M ;
Xiao, ZY ;
Xiao, YD ;
Lee, KH ;
Tropsha, A .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2003, 17 (02) :241-253
[9]  
Jonhos N.N., 1977, STAT EXPT DESIGN ENG
[10]   COMPUTER AIDED DESIGN OF EXPERIMENTS [J].
KENNARD, RW ;
STONE, LA .
TECHNOMETRICS, 1969, 11 (01) :137-&