Descriptor generation from Morgan fingerprint using persistent homology

被引:7
作者
Ehiro, T. [1 ]
机构
[1] Osaka Res Inst Ind Sci & Technol, Res Div Polymer Funct Mat, Izumi, Osaka, Japan
关键词
Cheminformatics; molecular descriptor; Morgan fingerprint; persistent homology; topological data analysis; REGRESSION;
D O I
10.1080/1062936X.2023.2301327
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In cheminformatics, molecular fingerprints (FPs) are used in various tasks such as regression and classification. However, predictive models often underutilize Morgan FP for regression and related tasks in machine learning. This study introduced descriptors derived from reshaped Morgan FPs using persistent homology for the predictive accuracy improvement. In the solvation free energy (FreeSolv) and water solubility (ESOL) datasets, persistent homology was found to enhance predictive accuracy compared to the use of only Morgan FPs. Notably, using the first-order persistence diagram (PD1) for descriptor generation resulted in more significant improvements than using the zeroth-order persistence diagram (PD0). Combining 4096 bits Morgan FPs with PD1-generated descriptors increased the average coefficient of determination in the Gaussian process regression from 0.597 to 0.667 for FreeSolv and from 0.629 to 0.654 for ESOL. Adjusting the grid size parameter during PD-based descriptor generation is crucial, as finer grids, especially with PD0, generate more descriptors but reduce predictive accuracy. Coarsening the grid or applying principal component analysis (PCA) mitigates overfitting and enhances accuracy. When descriptors were generated from Morgan FPs with randomly shuffled bit positions, coarsening the grid and/or applying PCA achieved similar accuracy improvements as when the persistent homology of the original Morgan FPs was used.
引用
收藏
页码:31 / 51
页数:21
相关论文
共 26 条
[1]  
Adams H, 2017, J MACH LEARN RES, V18
[2]  
[Anonymous], 2D BENCHMARKS MOL MA
[3]   Molecular fingerprint similarity search in virtual screening [J].
Cereto-Massague, Adria ;
Jose Ojeda, Maria ;
Valls, Cristina ;
Mulero, Miguel ;
Garcia-Vallve, Santiago ;
Pujadas, Gerard .
METHODS, 2015, 71 :58-63
[4]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[5]   ESOL: Estimating aqueous solubility directly from molecular structure [J].
Delaney, JS .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (03) :1000-1005
[6]   Are 2D fingerprints still valuable for drug discovery? [J].
Gao, Kaifu ;
Duc Duy Nguyen ;
Sresht, Vishnu ;
Mathiowetz, Alan M. ;
Tu, Meihua ;
Wei, Guo-Wei .
PHYSICAL CHEMISTRY CHEMICAL PHYSICS, 2020, 22 (16) :8373-8390
[7]  
HIRAOKA Y., 2018, Journal of Applied and Computational Topology, V1, P421, DOI [10.1007/s41468-018-0013-5, DOI 10.1007/S41468-018-0013-5]
[8]   RIDGE REGRESSION - BIASED ESTIMATION FOR NONORTHOGONAL PROBLEMS [J].
HOERL, AE ;
KENNARD, RW .
TECHNOMETRICS, 1970, 12 (01) :55-&
[9]  
HomCloud, About us
[10]   Protein-Folding Analysis Using Features Obtained by Persistent Homology [J].
Ichinomiya, Takashi ;
Obayashi, Ippei ;
Hiraoka, Yasuaki .
BIOPHYSICAL JOURNAL, 2020, 118 (12) :2926-2937