Image2SMILES: Transformer-Based Molecular Optical Recognition Engine

被引:24
作者
Khokhlov, Ivan [1 ]
Krasnov, Lev [1 ,2 ]
Fedorov, Maxim V. [1 ,3 ,4 ]
Sosnin, Sergey [1 ,4 ]
机构
[1] Syntelly LLC, Bolshoy Blvd 30,Bld 1, Moscow 121205, Russia
[2] Lomonosov Moscow State Univ, Dept Chem, 1 Leninskiye Gory, Moscow 1199911, Russia
[3] Sirius Univ Sci & Technol, Olimpiysky Ave B-1, Soci 354000, Russia
[4] Skolkovo Inst Sci & Technol, Bolshoy Blvd 30,Bld 1, Moscow 121205, Russia
来源
CHEMISTRY-METHODS | 2022年 / 2卷 / 01期
关键词
molecular OCR; machine learning; deep neural networks; Transformer; image captioning; image recognition; DECIMER; TOOL;
D O I
10.1002/cmtd.202100069
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The rise of deep learning in various scientific and technology areas promotes the development of AI-based tools for information retrieval. Optical recognition of organic structures is a key part of the automated extraction of chemical information. However, this is a challenging task because there is a large variety of representation styles. In this research, we present a Transformer-based artificial neural network to convert images of organic structures to molecular structures. To train the model, we created a comprehensive data generator that stochastically simulates various drawing styles, functional groups, functional group placeholders (R-groups), and visual contamination. We demonstrate that the Transformer-based architecture can gather chemical insights from our generator with almost absolute confidence. That means that, with Transformer, one can fully concentrate on data simulation to build a good recognition model. A web demo of our optical recognition engine is available online at Syntelly platform, and the code for dataset generation is available on GitHub.
引用
收藏
页数:13
相关论文
共 50 条
[1]   Synthesis of 2-(1H-Indol-2-yl)acetamides via Bronsted Acid-Assisted Cyclization Cascade [J].
Aksenov, Nicolai A. ;
Aksenov, Dmitrii A. ;
Skomorokhov, Anton A. ;
Prityko, Lidiya A. ;
Aksenov, Alexander V. ;
Griaznov, Georgii D. ;
Rubin, Michael .
JOURNAL OF ORGANIC CHEMISTRY, 2020, 85 (19) :12128-12146
[2]  
[Anonymous], RDKit: Open-Source Cheminformatics Software
[3]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165, 10.48550/arXiv.2005.14165]
[4]   Dienone Compounds: Targets and Pharmacological Responses [J].
Bazzaro, Martina ;
Linder, Stig .
JOURNAL OF MEDICINAL CHEMISTRY, 2020, 63 (24) :15075-15093
[5]   Albumentations: Fast and Flexible Image Augmentations [J].
Buslaev, Alexander ;
Iglovikov, Vladimir I. ;
Khvedchenya, Eugene ;
Parinov, Alex ;
Druzhinin, Mikhail ;
Kalinin, Alexandr A. .
INFORMATION, 2020, 11 (02)
[6]   Img2Mol-accurate SMILES recognition from molecular graphical depictions [J].
Clevert, Djork-Arne ;
Le, Tuan ;
Winter, Robin ;
Montanari, Floriane .
CHEMICAL SCIENCE, 2021, 12 (42) :14174-14181
[7]   Iridium(III)-Catalyzed Tandem Annulation of Pyridine-Substituted Anilines and α-Cl Ketones for Obtaining 2-Arylindoles [J].
Cui, Xin-Feng ;
Qiao, Xin ;
Wang, He-Song ;
Huang, Guo-Sheng .
JOURNAL OF ORGANIC CHEMISTRY, 2020, 85 (21) :13517-13528
[8]  
DeepOCSR, 2021, US
[9]   Ruthenium-Catalyzed E-Selective Alkyne Semihydrogenation with Alcohols as Hydrogen Donors [J].
Ekebergh, Andreas ;
Begon, Romain ;
Kann, Nina .
JOURNAL OF ORGANIC CHEMISTRY, 2020, 85 (05) :2966-2975
[10]   Development of Indoleamine 2,3-Dioxygenase 1 Inhibitors for Cancer Therapy and Beyond: A Recent Perspective [J].
Feng, Xi ;
Liao, Dongdong ;
Liu, Dongyu ;
Ping, An ;
Li, Zhiyu ;
Bian, Jinlei .
JOURNAL OF MEDICINAL CHEMISTRY, 2020, 63 (24) :15115-15139