Feature Extraction Methods in Quantitative StructureActivity Relationship Modeling: A Comparative Study

被引:23
作者
Alsenan, Shrooq A. [1 ]
Al-Turaiki, Isra M. [2 ]
Hafez, Alaaeldin M. [3 ]
机构
[1] Princess Nourah Bint Abdulrahman Univ, Coll Comp & Informat Sci, Res Ctr, Riyadh 11671, Saudi Arabia
[2] King Saud Univ, Dept Informat Technol, Coll Comp & Informat Sci, Riyadh 11451, Saudi Arabia
[3] King Saud Univ, Dept Informat Syst, Coll Comp & Informat Sci, Riyadh 11451, Saudi Arabia
关键词
Autoencoder; blood-brain barrier (BBB) permeability; deep generalized autoencoder (dGAE); dimensioanlity reduction; feature extraction; Gaussian random projection; principal component analysis; quantitative structure-activity relation (QSAR); sparse random projection; NONLINEAR DIMENSIONALITY REDUCTION; FEATURE-SELECTION; VARIABLE SELECTION; PLS ANALYSIS; DISCRIMINANT; PREDICTION; QSAR; CLASSIFICATION; AUTOENCODER; ACCURACY;
D O I
10.1109/ACCESS.2020.2990375
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computational approaches for synthesizing new chemical compounds have resulted in a major explosion of chemical data in the field of drug discovery. The quantitative structure & x2013;activity relationship (QSAR) is a widely used classification and regression method used to represent the relationship between a chemical structure and its activities. This research focuses on the effect of dimensionality-reduction techniques on a high-dimensional QSAR dataset. Because of the multi-dimensional nature of QSAR, dimensionality-reduction techniques have become an integral part of its modeling process. Principal component analysis (PCA) is a feature extraction technique with several applications in exploratory data analysis, visualization and dimensionality reduction. However, linear PCA is inadequate to handle the complex structure of QSAR data. In light of the wide array of current feature-extraction techniques, we perform a comparative empirical study to investigate five feature-extraction techniques: PCA, kernel PCA, deep generalized autoencoder (dGAE), Gaussian random projection (GRP), and sparse random projection (SRP). The experiments are performed on a high-dimensional QSAR dataset, which comprises 6394 features. The transformed low-dimensional dataset is inputted into a deep learning classification model to predict a QSAR biological activity. Three approaches are adopted to validate and measure the proposed techniques: (i) comparing the performance of the classification models, (ii) visualizing the relationship (correlation) between features in the low-dimension Euclidean space, and (iii) validating the proposed techniques using an external dataset. To the best of our knowledge, this study is the first to investigate and compare the aforementioned feature-extraction techniques in QSAR modeling context. The results obtained provide invaluable insights regarding the behavior of different techniques with both negative and positive classes. With linear PCA as a baseline, we prove that the investigated techniques substantially outperform the baseline in multiple accuracy measures and demonstrate useful ways of extracting significant features.
引用
收藏
页码:78737 / 78752
页数:16
相关论文
共 90 条
[1]   Discriminant and quantitative PLS analysis of competitive CYP2C9 inhibitors versus non-inhibitors using alignment independent GRIND descriptors [J].
Afzelius, L ;
Masimirembwa, CM ;
Karlén, A ;
Andersson, TB ;
Zamora, I .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2002, 16 (07) :443-458
[2]   Comparative analysis of nonlinear dimensionality reduction techniques for breast MRI segmentation [J].
Akhbardeh, Alireza ;
Jacobs, Michael A. .
MEDICAL PHYSICS, 2012, 39 (04) :2275-2289
[3]  
Akosa J. S., 2017, P SAS GLOBAL FORUM 2, P942
[4]  
[Anonymous], 2001, PODS
[5]  
[Anonymous], 1984, C MODERN ANAL PROBAB
[6]  
[Anonymous], 2000, 5 ONLINE WORLD C SOF
[7]  
[Anonymous], 2010, P 24 AAAI C ART INT
[8]  
[Anonymous], 1997, P 14 INT C MACH LEAR
[9]  
Bingham E., 2001, KDD-2001. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P245, DOI 10.1145/502512.502546
[10]   Improving Chemical Autoencoder Latent Space and Molecular De Novo Generation Diversity with Heteroencoders [J].
Bjerrum, Esben Jannik ;
Sattarov, Boris .
BIOMOLECULES, 2018, 8 (04)