Back translation for molecule generation

被引:7
作者
Fan, Yang [1 ]
Xia, Yingce [2 ]
Zhu, Jinhua [1 ]
Wu, Lijun [2 ]
Xie, Shufang [2 ]
Qin, Tao [2 ]
机构
[1] Univ Sci & Technol China, Hefei 230027, Anhui, Peoples R China
[2] Microsoft Res, Beijing 100080, Peoples R China
关键词
D O I
10.1093/bioinformatics/btab817
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Molecule generation, which is to generate new molecules, is an important problem in bioinformatics. Typical tasks include generating molecules with given properties, molecular property improvement (i.e. improving specific properties of an input molecule), retrosynthesis (i.e. predicting the molecules that can be used to synthesize a target molecule), etc. Recently, deep-learning-based methods received more attention for molecule generation. The labeled data of bioinformatics is usually costly to obtain, but there are millions of unlabeled molecules. Inspired by the success of sequence generation in natural language processing with unlabeled data, we would like to explore an effective way of using unlabeled molecules for molecule generation. Results: We propose a new method, back translation for molecule generation, which is a simple yet effective semi-supervised method. Let X be the source domain, which is the collection of properties, the molecules to be optimized, etc. Let y be the target domain which is the collection of molecules. In particular, given a main task which is about to learn a mapping from the source domain X to the target domain y, we first train a reversed model g for the y to X mapping. After that, we use g to back translate the unlabeled data in y to X and obtain more synthetic data. Finally, we combine the synthetic data with the labeled data and train a model for the main task. We conduct experiments on molecular property improvement and retrosynthesis, and we achieve state-of-the-art results on four molecule generation tasks and one retrosynthesis benchmark, USPTO-50k. Availability and implementation: Our code and data are available at https://github.com/fyabc/BT4MolGen.
引用
收藏
页码:1244 / 1251
页数:8
相关论文
共 45 条
[1]  
Bickerton GR, 2012, NAT CHEM, V4, P90, DOI [10.1038/NCHEM.1243, 10.1038/nchem.1243]
[2]   PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning [J].
Born, Jannis ;
Manica, Matteo ;
Oskooei, Ali ;
Cadow, Joris ;
Markert, Greta ;
Martinez, Maria Rodriguez .
ISCIENCE, 2021, 24 (04)
[3]  
Chen T, 2020, PR MACH LEARN RES, V119
[4]   Computer-Assisted Retrosynthesis Based on Molecular Similarity [J].
Coley, Connor W. ;
Rogers, Luke ;
Green, William H. ;
Jensen, Klavs F. .
ACS CENTRAL SCIENCE, 2017, 3 (12) :1237-1245
[5]   THE LOGIC OF CHEMICAL SYNTHESIS - MULTISTEP SYNTHESIS OF COMPLEX CARBOGENIC MOLECULES [J].
COREY, EJ .
ANGEWANDTE CHEMIE-INTERNATIONAL EDITION IN ENGLISH, 1991, 30 (05) :455-465
[6]  
Dai Hanjun, 2019, Advances in Neural Information Processing Systems, V32
[7]   mmpdb: An Open-Source Matched Molecular Pair Platform for Large Multiproperty Data Sets [J].
Dalke, Andrew ;
Hert, Jerome ;
Kramer, Christian .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2018, 58 (05) :902-910
[8]  
De Cao N., 2018, An implicit generative model for small molecular graphs
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171