Regression Transformer enables concurrent sequence regression and generation for molecular language modelling

被引:65
作者
Born, Jannis [1 ,2 ]
Manica, Matteo [1 ]
机构
[1] IBM Res Europe, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Dept Biosyst Sci & Engn, Basel, Switzerland
关键词
DESIGN;
D O I
10.1038/s42256-023-00639-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer models are gaining increasing popularity in modelling natural language as they can produce human-sounding text by iteratively predicting the next word in a sentence. Born and Manica apply the idea of Transformer-based text completion to property prediction of chemical compounds by providing the context of a problem and having the model complete the missing information. Despite tremendous progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modelling problem. This introduces a new direction for multitask language models, seamlessly bridging sequence regression and conditional sequence generation. We demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by an alternating training scheme that enables the model to decorate seed sequences on the basis of desired property constraints, for example, to optimize reaction yield. We expect that the RT's capability to jointly tackle predictive and generative tasks in biochemistry can find applications in property-driven, local exploration of the chemical or protein space. Such multitask approaches will pave the road towards foundation models in materials design.
引用
收藏
页码:432 / +
页数:21
相关论文
共 90 条
[21]   An algorithm to identify functional groups in organic molecules [J].
Ertl, Peter .
JOURNAL OF CHEMINFORMATICS, 2017, 9
[22]   Response to Comment on "Predicting reaction performance in C-N cross-coupling using machine learning" [J].
Estrada, Jesus G. ;
Ahneman, Derek T. ;
Sheridan, Robert P. ;
Dreher, Spencer D. ;
Doyle, Abigail G. .
SCIENCE, 2018, 362 (6416)
[23]  
Fabian B., 2020, ARXIV, DOI DOI 10.48550/ARXIV.2011.13230
[24]   Back translation for molecule generation [J].
Fan, Yang ;
Xia, Yingce ;
Zhu, Jinhua ;
Wu, Lijun ;
Xie, Shufang ;
Qin, Tao .
BIOINFORMATICS, 2022, 38 (05) :1244-1251
[25]  
Fried D., 2022, ABS220405999 CORR, DOI DOI 10.48550/ARXIV.2204.05999
[26]  
Fu T., 2022, 10 INT C LEARN REPR
[27]  
Gilmer J, 2017, PR MACH LEARN RES, V70
[28]   Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules [J].
Gomez-Bombarelli, Rafael ;
Wei, Jennifer N. ;
Duvenaud, David ;
Hernandez-Lobato, Jose Miguel ;
Sanchez-Lengeling, Benjamin ;
Sheberla, Dennis ;
Aguilera-Iparraguirre, Jorge ;
Hirzel, Timothy D. ;
Adams, Ryan P. ;
Aspuru-Guzik, Alan .
ACS CENTRAL SCIENCE, 2018, 4 (02) :268-276
[29]  
He Pengcheng, 2021, INT C LEARNING REPRE
[30]   Chemformer: a pre-trained transformer for computational chemistry [J].
Irwin, Ross ;
Dimitriadis, Spyridon ;
He, Jiazhen ;
Bjerrum, Esben Jannik .
MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2022, 3 (01)