Img2Mol-accurate SMILES recognition from molecular graphical depictions

被引:42
作者
Clevert, Djork-Arne [1 ]
Le, Tuan [1 ]
Winter, Robin [1 ]
Montanari, Floriane [1 ]
机构
[1] Bayer AG, Machine Learning Res, Berlin, Germany
关键词
29;
D O I
10.1039/d1sc01839f
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The automatic recognition of the molecular content of a molecule's graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. Recent advances in neural machine translation enable the auto-encoding of molecular structures in a continuous vector space of fixed size (latent representation) with low reconstruction errors. In this paper, we present a fast and accurate model combining deep convolutional neural network learning from molecule depictions and a pre-trained decoder that translates the latent representation into the SMILES representation of the molecules. This combination allows us to precisely infer a molecular structure from an image. Our rigorous evaluation shows that Img2Mol is able to correctly translate up to 88% of the molecular depictions into their SMILES representation. A pretrained version of Img2Mol is made publicly available on GitHub for non-commercial users.
引用
收藏
页码:14174 / 14181
页数:8
相关论文
共 28 条
[1]   ChemSchematicResolver: A Toolkit to Decode 2D Chemical Diagrams with Labels and R-Groups into Annotated Chemical Named Entities [J].
Beard, Edward J. ;
Cole, Jacqueline M. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2020, 60 (04) :2059-2072
[2]   Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution [J].
Filippov, Igor V. ;
Nicklaus, Marc C. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (03) :740-743
[3]   Markov Logic Networks for Optical Chemical Structure Recognition [J].
Frasconi, Paolo ;
Gabbrielli, Francesco ;
Lippi, Marco ;
Marinai, Simone .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2014, 54 (08) :2380-2390
[4]   The ChEMBL database in 2017 [J].
Gaulton, Anna ;
Hersey, Anne ;
Nowotka, Michal ;
Bento, A. Patricia ;
Chambers, Jon ;
Mendez, David ;
Mutowo, Prudence ;
Atkinson, Francis ;
Bellis, Louisa J. ;
Cibrian-Uhalte, Elena ;
Davies, Mark ;
Dedman, Nathan ;
Karlsson, Anneli ;
Magarinos, Maria Paula ;
Overington, John P. ;
Papadatos, George ;
Smit, Ines ;
Leach, Andrew R. .
NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) :D945-D954
[5]   ChEMBL: a large-scale bioactivity database for drug discovery [J].
Gaulton, Anna ;
Bellis, Louisa J. ;
Bento, A. Patricia ;
Chambers, Jon ;
Davies, Mark ;
Hersey, Anne ;
Light, Yvonne ;
McGlinchey, Shaun ;
Michalovich, David ;
Al-Lazikani, Bissan ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D1100-D1107
[6]   Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules [J].
Gomez-Bombarelli, Rafael ;
Wei, Jennifer N. ;
Duvenaud, David ;
Hernandez-Lobato, Jose Miguel ;
Sanchez-Lengeling, Benjamin ;
Sheberla, Dennis ;
Aguilera-Iparraguirre, Jorge ;
Hirzel, Timothy D. ;
Adams, Ryan P. ;
Aspuru-Guzik, Alan .
ACS CENTRAL SCIENCE, 2018, 4 (02) :268-276
[7]   patRoon: open source software platform for environmental mass spectrometry based non-target screening [J].
Helmus, Rick ;
ter Laak, Thomas L. ;
van Wezel, Annemarie P. ;
de Voogt, Pim ;
Schymanski, Emma L. .
JOURNAL OF CHEMINFORMATICS, 2021, 13 (01)
[8]  
Jungkap Park, 2010, 2010 IEEE International Conference on Automation Science and Engineering (CASE 2010), P168, DOI 10.1109/COASE.2010.5584695
[9]   PubChem in 2021: new data content and improved web interfaces [J].
Kim, Sunghwan ;
Chen, Jie ;
Cheng, Tiejun ;
Gindulyte, Asta ;
He, Jia ;
He, Siqian ;
Li, Qingliang ;
Shoemaker, Benjamin A. ;
Thiessen, Paul A. ;
Yu, Bo ;
Zaslavsky, Leonid ;
Zhang, Jian ;
Bolton, Evan E. .
NUCLEIC ACIDS RESEARCH, 2021, 49 (D1) :D1388-D1395
[10]   PubChem 2019 update: improved access to chemical data [J].
Kim, Sunghwan ;
Chen, Jie ;
Cheng, Tiejun ;
Gindulyte, Asta ;
He, Jia ;
He, Siqian ;
Li, Qingliang ;
Shoemaker, Benjamin A. ;
Thiessen, Paul A. ;
Yu, Bo ;
Zaslavsky, Leonid ;
Zhang, Jian ;
Bolton, Evan E. .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D1102-D1109