Regression Transformer enables concurrent sequence regression and generation for molecular language modelling

被引:52
作者
Born, Jannis [1 ,2 ]
Manica, Matteo [1 ]
机构
[1] IBM Res Europe, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Dept Biosyst Sci & Engn, Basel, Switzerland
关键词
DESIGN;
D O I
10.1038/s42256-023-00639-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer models are gaining increasing popularity in modelling natural language as they can produce human-sounding text by iteratively predicting the next word in a sentence. Born and Manica apply the idea of Transformer-based text completion to property prediction of chemical compounds by providing the context of a problem and having the model complete the missing information. Despite tremendous progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modelling problem. This introduces a new direction for multitask language models, seamlessly bridging sequence regression and conditional sequence generation. We demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by an alternating training scheme that enables the model to decorate seed sequences on the basis of desired property constraints, for example, to optimize reaction yield. We expect that the RT's capability to jointly tackle predictive and generative tasks in biochemistry can find applications in property-driven, local exploration of the chemical or protein space. Such multitask approaches will pave the road towards foundation models in materials design.
引用
收藏
页码:432 / +
页数:21
相关论文
共 90 条
  • [1] Abid Abubakar, 2019, ARXIV, DOI DOI 10.48550/ARXIV.1906.02569
  • [2] Unified rational protein engineering with sequence-based deep representation learning
    Alley, Ethan C.
    Khimulya, Grigory
    Biswas, Surojit
    AlQuraishi, Mohammed
    Church, George M.
    [J]. NATURE METHODS, 2019, 16 (12) : 1315 - +
  • [3] MolGPT: Molecular Generation Using a Transformer-Decoder Model
    Bagal, Viraj
    Aggarwal, Rishal
    Vinod, P. K.
    Priyakumar, U. Deva
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2022, 62 (09) : 2064 - 2076
  • [4] Bai H, 2021, AAAI CONF ARTIF INTE, V35, P12526
  • [5] UniProt: the universal protein knowledgebase in 2021
    Bateman, Alex
    Martin, Maria-Jesus
    Orchard, Sandra
    Magrane, Michele
    Agivetova, Rahat
    Ahmad, Shadab
    Alpi, Emanuele
    Bowler-Barnett, Emily H.
    Britto, Ramona
    Bursteinas, Borisas
    Bye-A-Jee, Hema
    Coetzee, Ray
    Cukura, Austra
    Da Silva, Alan
    Denny, Paul
    Dogan, Tunca
    Ebenezer, ThankGod
    Fan, Jun
    Castro, Leyla Garcia
    Garmiri, Penelope
    Georghiou, George
    Gonzales, Leonardo
    Hatton-Ellis, Emma
    Hussein, Abdulrahman
    Ignatchenko, Alexandr
    Insana, Giuseppe
    Ishtiaq, Rizwan
    Jokinen, Petteri
    Joshi, Vishal
    Jyothi, Dushyanth
    Lock, Antonia
    Lopez, Rodrigo
    Luciani, Aurelien
    Luo, Jie
    Lussi, Yvonne
    Mac-Dougall, Alistair
    Madeira, Fabio
    Mahmoudy, Mahdi
    Menchi, Manuela
    Mishra, Alok
    Moulang, Katie
    Nightingale, Andrew
    Oliveira, Carla Susana
    Pundir, Sangya
    Qi, Guoying
    Raj, Shriya
    Rice, Daniel
    Lopez, Milagros Rodriguez
    Saidi, Rabie
    Sampson, Joseph
    [J]. NUCLEIC ACIDS RESEARCH, 2021, 49 (D1) : D480 - D489
  • [6] Bavarian M., 2022, PREPRINT, DOI DOI 10.48550/ARXIV.2207.14255
  • [7] The properties of known drugs .1. Molecular frameworks
    Bemis, GW
    Murcko, MA
    [J]. JOURNAL OF MEDICINAL CHEMISTRY, 1996, 39 (15) : 2887 - 2893
  • [8] Bickerton GR, 2012, NAT CHEM, V4, P90, DOI [10.1038/NCHEM.1243, 10.1038/nchem.1243]
  • [9] Bjerrum EJ, 2017, PREPRINT, DOI DOI 10.48550/ARXIV.1703.07076
  • [10] Antibacterial peptides: basic facts and emerging concepts
    Boman, HG
    [J]. JOURNAL OF INTERNAL MEDICINE, 2003, 254 (03) : 197 - 215