A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood

被引:37
作者
Aniceto, Natalia [1 ,2 ]
Freitas, Alex A. [3 ]
Bender, Andreas [4 ]
Ghafourian, Taravat [5 ]
机构
[1] Univ Kent, Medway Sch Pharm, Anson Bldg,Cent Ave, Chatham ME4 4TB, Kent, England
[2] Univ Greenwich, Medway Sch Pharm, Anson Bldg,Cent Ave, Chatham ME4 4TB, Kent, England
[3] Univ Kent, Sch Comp, Canterbury CT2 7NF, Kent, England
[4] Univ Cambridge, Dept Chem, Ctr Mol Sci Informat, Lensfield Rd, Cambridge CB2 1EW, England
[5] Univ Sussex, Sch Life Sci, JMS Bldg, Brighton BN1 9QG, E Sussex, England
关键词
QSAR; Applicability domain; P-gp; Prediction reliability; k-Nearest neighbour; dk-NN; Kernel density estimation; P-glycoprotein; ATTRIBUTE SELECTION; TRAINING SET; UNCERTAINTY; MODELS; CLASSIFICATION; ACCURACY;
D O I
10.1186/s13321-016-0182-y
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The ability to define the regions of chemical space where a predictive model can be safely used is a necessary condition to assure the reliability of new predictions. This implies that reliability must be determined across chemical space in the attempt to localize "safe" and "unsafe" regions for prediction. As a result we devised an applicability domain technique that addresses the data locally instead of handling it as a whole-the reliability-density neighbourhood (RDN). The main novelty aspect of this method is that it characterizes each single training instance according to the density of its neighbourhood in the training set, as well as its individual bias and precision. By scanning through the chemical space (by iteratively increasing the applicability domain area), it was observed that new test compounds are successively included into the applicability domain region in such a manner that strongly correlates to their predictive performance. This allows the mapping of local reliability across different locations in the training set space, and thus allows identifying regions where the model has low reliability. This method also showed matching profiles between two external sets, which is an indication that it performs robustly with new data. Another novel aspect in this technique is that it is paired with a specific feature selection algorithm. As a result, the impact of the feature set used was studied from which the top 20 features selected by ReliefF yielded the best results, as opposed to using the model's features or the entire feature set as commonly done. As the third novel aspect, in this work we propose a new scoring function to help evaluate the quality of an applicability domain profile (i.e., the curve of accuracy vs the applicability domain measure in question). Overall, the RDN showed to be a promising method that can correctly sort new instances according to predictive performance. As a result, this technique can be received by an end-user as proof of concept for the performance of a QSAR model in new data, thus promoting the user's trust on the QSAR output.
引用
收藏
页数:20
相关论文
共 41 条
[1]   Simultaneous Prediction of four ATP-binding Cassette Transporters' Substrates Using Multi-label QSAR [J].
Aniceto, Natalia ;
Freitas, Alex A. ;
Bender, Andreas ;
Ghafourian, Taravat .
MOLECULAR INFORMATICS, 2016, 35 (10) :514-528
[2]  
[Anonymous], 2014, Data classification: Algorithms and applications, DOI [DOI 10.1201/B17320, 10.1201/b17320]
[3]   Variability in P-Glycoprotein Inhibitory Potency (IC50) Using Various in Vitro Experimental Systems: Implications for Universal Digoxin Drug-Drug Interaction Risk Assessment Decision Criteria [J].
Bentz, Joe ;
O'Connor, Michael P. ;
Bednarczyk, Dallas ;
Coleman, JoAnn ;
Lee, Caroline ;
Palm, Johan ;
Pak, Y. Anne ;
Perloff, Elke S. ;
Reyner, Eric ;
Balimane, Praveen ;
Brannstrom, Marie ;
Chu, Xiaoyan ;
Funk, Christoph ;
Guo, Ailan ;
Hanna, Imad ;
Heredi-Szabo, Krisztina ;
Hillgren, Kate ;
Li, Libin ;
Hollnack-Pusch, Evelyn ;
Jamei, Masoud ;
Lin, Xuena ;
Mason, Andrew K. ;
Neuhoff, Sibylle ;
Patel, Aarti ;
Podila, Lalitha ;
Plise, Emile ;
Rajaraman, Ganesh ;
Salphati, Laurent ;
Sands, Eric ;
Taub, Mitchell E. ;
Taur, Jan-Shiang ;
Weitz, Dietmar ;
Wortelboer, Heleen M. ;
Xia, Cindy Q. ;
Xiao, Guangqing ;
Yabut, Jocelyn ;
Yamagata, Tetsuo ;
Zhang, Lei ;
Ellens, Harma .
DRUG METABOLISM AND DISPOSITION, 2013, 41 (07) :1347-1366
[4]   A Distributed Feature Selection Approach Based on a Complexity Measure [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
ADVANCES IN COMPUTATIONAL INTELLIGENCE, PT II, 2015, 9095 :15-28
[5]   A review of feature selection methods on synthetic data [J].
Bolon-Canedo, Veronica ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) :483-519
[6]   QSAR Models for P-Glycoprotein Transport Based on a Highly Consistent Data Set [J].
Broccatelli, Pablo .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2012, 52 (09) :2462-2470
[7]   Applicability Domain Analysis (ADAN): A Robust Method for Assessing the Reliability of Drug Property Predictions [J].
Carrio, Pau ;
Pinto, Marta ;
Ecker, Gerhard ;
Sanz, Ferran ;
Pastor, Manuel .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2014, 54 (05) :1500-1511
[8]  
Chawla NV, 2006, LECT NOTES ARTIF INT, V3944, P41
[9]   Molecular Basis of the Polyspecificity of P-Glycoprotein (ABCB1): Recent Biochemical and Structural Studies [J].
Chufan, Eduardo E. ;
Sim, Hong-May ;
Ambudkar, Suresh V. .
ABC TRANSPORTERS AND CANCER, 2015, 125 :71-96
[10]   Curve matching, time warping, and light fields: New algorithms for computing similarity between curves [J].
Efrat, Alon ;
Fan, Quanfu ;
Venkatasubramanian, Suresh .
JOURNAL OF MATHEMATICAL IMAGING AND VISION, 2007, 27 (03) :203-216