Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data

被引:71
作者
Li, Hongjian [1 ,2 ]
Peng, Jiangjun [3 ]
Sidorov, Pavel [4 ]
Leung, Yee [5 ]
Leung, Kwong-Sak [5 ,6 ]
Wong, Man-Hon [6 ]
Lu, Gang [2 ]
Ballester, Pedro J. [4 ]
机构
[1] SDIVF R&D Ctr, Sha Tin, Hong Kong Sci Pk, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Genet Sch Biomed Sci, CUHK SDU Joint Reprod Genet, Sha Tin, Hong Kong, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Math & Stat, Xian, Shaanxi, Peoples R China
[4] Aix Marseille Univ, CNRS, Inst Paoli Calmettes, INSERM,Canc Res Ctr Marseille, F-13009 Marseille, France
[5] Chinese Univ Hong Kong, Inst Future Cities, Sha Tin, Hong Kong, Peoples R China
[6] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Sha Tin, Hong Kong, Peoples R China
关键词
BINDING-AFFINITY PREDICTION; RANDOM FOREST; PROTEIN; ACCURACY; VALIDATION; NNSCORE;
D O I
10.1093/bioinformatics/btz183
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results: We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing.
引用
收藏
页码:3989 / 3995
页数:7
相关论文
共 31 条
[1]  
[Anonymous], 2015, Wiley Interdiscip Rev: Comput Mol Sci, DOI [10.1002/wcms.1225, DOI 10.1002/WCMS.1225]
[2]  
Ballester PJ, 2012, LECT NOTES COMPUT SC, V7632, P14, DOI 10.1007/978-3-642-34123-6_2
[3]   Does a More Precise Chemical Description of Protein-Ligand Complexes Lead to More Accurate Prediction of Binding Affinity? [J].
Ballester, Pedro J. ;
Schreyer, Adrian ;
Blundell, Tom L. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2014, 54 (03) :944-955
[4]   Comments on "Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets": Significance for the Validation of Scoring Functions [J].
Ballester, Pedro J. ;
Mitchell, John B. O. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (08) :1739-1741
[5]   A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking [J].
Ballester, Pedro J. ;
Mitchell, John B. O. .
BIOINFORMATICS, 2010, 26 (09) :1169-1175
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions [J].
Cang, Zixuan ;
Wei, Guowei .
PLOS COMPUTATIONAL BIOLOGY, 2017, 13 (07)
[8]   Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model [J].
Cao, Yang ;
Li, Lei .
BIOINFORMATICS, 2014, 30 (12) :1674-1680
[9]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[10]   Comparative Assessment of Scoring Functions on a Diverse Test Set [J].
Cheng, Tiejun ;
Li, Xun ;
Li, Yan ;
Liu, Zhihai ;
Wang, Renxiao .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (04) :1079-1093