Support vector machine regression (LS-SVM)-an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data?

被引:171
作者
Balabin, Roman M. [1 ]
Lomakina, Ekaterina I. [2 ]
机构
[1] ETH, Dept Chem & Appl Biosci, CH-8093 Zurich, Switzerland
[2] ETH, Dept Comp Sci, CH-8093 Zurich, Switzerland
关键词
NEAR-INFRARED SPECTROSCOPY; COMBINED 1ST-PRINCIPLES CALCULATION; ALKANES RAMAN-SPECTROSCOPY; POTENTIAL-ENERGY SURFACES; DENSITY-FUNCTIONAL THEORY; NIR SPECTROSCOPY; N-PENTANE; GASOLINE CLASSIFICATION; ENTHALPY DIFFERENCE; BASE STOCK;
D O I
10.1039/c1cp00051a
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
A multilayer feed-forward artificial neural network (MLP-ANN) with a single, hidden layer that contains a finite number of neurons can be regarded as a universal non-linear approximator. Today, the ANN method and linear regression (MLR) model are widely used for quantum chemistry (QC) data analysis (e. g., thermochemistry) to improve their accuracy (e. g., Gaussian G2-G4, B3LYP/B3-LYP, X1, or W1 theoretical methods). In this study, an alternative approach based on support vector machines (SVMs) is used, the least squares support vector machine (LS-SVM) regression. It has been applied to ab initio (first principle) and density functional theory (DFT) quantum chemistry data. So, QC + SVM methodology is an alternative to QC + ANN one. The task of the study was to estimate the Moller-Plesset (MPn) or DFT (B3LYP, BLYP, BMK) energies calculated with large basis sets (e. g., 6-311G(3df, 3pd)) using smaller ones (6-311G, 6-311G*, 6-311G**) plus molecular descriptors. A molecular set (BRM-208) containing a total of 208 organic molecules was constructed and used for the LS-SVM training, cross-validation, and testing. MP2, MP3, MP4(DQ), MP4(SDQ), and MP4/MP4(SDTQ) ab initio methods were tested. Hartree-Fock (HF/SCF) results were also reported for comparison. Furthermore, constitutional (CD: total number of atoms and mole fractions of different atoms) and quantum-chemical (QD: HOMO-LUMO gap, dipole moment, average polarizability, and quadrupole moment) molecular descriptors were used for the building of the LS-SVM calibration model. Prediction accuracies (MADs) of 1.62 perpendicular to 0.51 and 0.85 +/- 0.24 kcal mol(-1) (1 kcal mol(-1) = 4.184 kJ mol(-1)) were reached for SVM-based approximations of ab initio and DFT energies, respectively. The LS-SVM model was more accurate than the MLR model. A comparison with the artificial neural network approach shows that the accuracy of the LS-SVM method is similar to the accuracy of ANN. The extrapolation and interpolation results show that LS-SVM is superior by almost an order of magnitude over the ANN method in terms of the stability, generality, and robustness of the final model. The LS-SVM model needs a much smaller numbers of samples (a much smaller sample set) to make accurate prediction results. Potential energy surface (PES) approximations for molecular dynamics (MD) studies are discussed as a promising application for the LS-SVM calibration approach.
引用
收藏
页码:11710 / 11718
页数:9
相关论文
共 91 条
[11]   Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectroscopy data for gasoline properties prediction [J].
Balabin, Roman M. ;
Safieva, Ravilya Z. ;
Lomakina, Ekaterma I. .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2007, 88 (02) :183-188
[12]   Biodiesel classification by base stock type (vegetable oil) using near infrared spectroscopy data [J].
Balabin, Roman M. ;
Safieva, Ravilya Z. .
ANALYTICA CHIMICA ACTA, 2011, 689 (02) :190-197
[13]   Variable selection in near-infrared spectroscopy: Benchmarking of feature selection methods on biodiesel data [J].
Balabin, Roman M. ;
Smirnov, Sergey V. .
ANALYTICA CHIMICA ACTA, 2011, 692 (1-2) :63-72
[14]   Support vector machine regression (SVR/LS-SVM)-an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data [J].
Balabin, Roman M. ;
Lomakina, Ekaterina I. .
ANALYST, 2011, 136 (08) :1703-1712
[15]   Near-infrared (NIR) spectroscopy for motor oil classification: From discriminant analysis to support vector machines [J].
Balabin, Roman M. ;
Safieva, Ravilya Z. ;
Lomakina, Ekaterina I. .
MICROCHEMICAL JOURNAL, 2011, 98 (01) :121-128
[16]   Neural network (ANN) approach to biodiesel analysis: Analysis of biodiesel density, kinematic viscosity, methanol and water contents using near infrared (NIR) spectroscopy [J].
Balabin, Roman M. ;
Lomakina, Ekaterina I. ;
Safieva, Ravilya Z. .
FUEL, 2011, 90 (05) :2007-2015
[17]   Asphaltene Adsorption onto an Iron Surface: Combined Near-Infrared (NIR), Raman, and AFM Study of the Kinetics, Thermodynamics, and Layer Structure [J].
Balabin, Roman M. ;
Syunyaev, Rustem Z. ;
Schmid, Thomas ;
Stadler, Johannes ;
Lomakina, Ekaterina I. ;
Zenobi, Renato .
ENERGY & FUELS, 2011, 25 (01) :189-196
[18]   Gasoline classification using near infrared (NIR) spectroscopy data: Comparison of multivariate techniques [J].
Balabin, Roman M. ;
Safieva, Ravilya Z. ;
Lomakina, Ekaterina I. .
ANALYTICA CHIMICA ACTA, 2010, 671 (1-2) :27-35
[19]   Communications: Is quantum chemical treatment of biopolymers accurate? Intramolecular basis set superposition error (BSSE) [J].
Balabin, Roman M. .
JOURNAL OF CHEMICAL PHYSICS, 2010, 132 (23)
[20]   Reply to "Comment on 'Enthalpy Difference between Conformations of Normal Alkanes: Raman Spectroscopy Study of n-Pentane and n-Butane'" [J].
Balabin, Roman M. .
JOURNAL OF PHYSICAL CHEMISTRY A, 2010, 114 (24) :6729-6730