Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition

被引:8
作者
de Azevedo, Luis Cesar [1 ]
Pinheiro, Gabriel A. [2 ]
Quiles, Marcos G. [2 ]
Da Silva, Juarez L. F. [3 ]
Prati, Ronaldo C. [1 ]
机构
[1] Fed Univ ABC, Ctr Math Computat & Cognit, BR-5001 Santo Andre, SP, Brazil
[2] Fed Univ Sao Paulo Unifesp, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP, Brazil
[3] Univ Sao Paulo, Sao Carlos Inst Chem, BR-13560970 Sao Carlos, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
PREDICTION; MOLECULES;
D O I
10.1021/acs.jcim.1c00503
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.
引用
收藏
页码:4210 / 4223
页数:14
相关论文
共 51 条
[1]   Machine learning unifies the modeling of materials and molecules [J].
Bartok, Albert P. ;
De, Sandip ;
Poelking, Carl ;
Bernstein, Noam ;
Kermode, James R. ;
Csanyi, Gabor ;
Ceriotti, Michele .
SCIENCE ADVANCES, 2017, 3 (12)
[2]   Generalized neural-network representation of high-dimensional potential-energy surfaces [J].
Behler, Joerg ;
Parrinello, Michele .
PHYSICAL REVIEW LETTERS, 2007, 98 (14)
[3]   Atom-centered symmetry functions for constructing high-dimensional neural network potentials [J].
Behler, Joerg .
JOURNAL OF CHEMICAL PHYSICS, 2011, 134 (07)
[4]   Kohn-Sham density functional theory: Predicting and understanding chemistry [J].
Bickelhaupt, FM ;
Baerends, EJ .
REVIEWS IN COMPUTATIONAL CHEMISTRY, VOL 15, 2000, 15 :1-86
[5]  
Bouckaert RR, 2008, LECT NOTES ARTIF INT, V5360, P247, DOI 10.1007/978-3-540-89378-3_24
[6]   A Critical Review of Machine Learning of Energy Materials [J].
Chen, Chi ;
Zuo, Yunxing ;
Ye, Weike ;
Li, Xiangguo ;
Deng, Zhi ;
Ong, Shyue Ping .
ADVANCED ENERGY MATERIALS, 2020, 10 (08)
[7]   Algebraic graph-assisted bidirectional transformers for molecular property prediction [J].
Chen, Dong ;
Gao, Kaifu ;
Duc Duy Nguyen ;
Chen, Xin ;
Jiang, Yi ;
Wei, Guo-Wei ;
Pan, Feng .
NATURE COMMUNICATIONS, 2021, 12 (01)
[8]   Quantifying Bias and Variance of System Rankings [J].
Cormack, Gordon V. ;
Grossman, Maura R. .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :1089-1092
[9]  
Cramer H., 1999, Mathematical methods of statistics
[10]   Comparing molecules and solids across structural and alchemical space [J].
De, Sandip ;
Bartok, Albert P. ;
Csanyi, Gabor ;
Ceriotti, Michele .
PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2016, 18 (20) :13754-13769