Large-Scale Distributed Training of Transformers for Chemical Fingerprinting

被引:19
作者
Abdel-Aty, Hisham [1 ,2 ]
Gould, Ian R. [1 ,2 ]
机构
[1] Imperial Coll London, Dept Chem, Mol Sci Res Hub, London W12 0BZ, England
[2] Imperial Coll London, Inst Chem Biol, Mol Sci Res Hub, London W12 0BZ, England
基金
英国工程与自然科学研究理事会;
关键词
SMALL MOLECULES;
D O I
10.1021/acs.jcim.2c00715
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Transformer models have become a popular choice for various machine learning tasks due to their often outstanding performance. Recently, transformers have been used in chemistry for classifying reactions, reaction prediction, physiochemical property prediction, and more. These models require huge amounts of data and localized compute to train effectively. In this work, we demonstrate that these models can successfully be trained for chemical problems in a distributed manner across many computers-a more common scenario for chemistry institutions. We introduce MFBERT: Molecular Fingerprints through Bidirectional Encoder Representations from Transformers. We use distributed computing to pre-train a transformer model on one of the largest aggregate datasets in chemical literature and achieve state-of-the-art scores on a virtual screening benchmark for molecular fingerprints. We then fine-tune our model on smaller, more specific datasets to generate more targeted fingerprints and assess their quality. We utilize a Sentence-Piece tokenization model, where the whole procedure from raw molecular representation to molecular fingerprints becomes data-driven, with no explicit tokenization rules.
引用
收藏
页码:4852 / 4862
页数:11
相关论文
共 40 条
[1]  
[Anonymous], 2019, DISTRIBUTED TRAINING
[2]   Exploring the GDB-13 chemical space using deep generative models [J].
Arus-Pous, Josep ;
Blaschke, Thomas ;
Ulander, Silas ;
Reymond, Jean-Louis ;
Chen, Hongming ;
Engkvist, Ola .
JOURNAL OF CHEMINFORMATICS, 2019, 11 (1)
[3]   970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13 [J].
Blum, Lorenz C. ;
Reymond, Jean-Louis .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2009, 131 (25) :8732-+
[4]   Extracting Predictive Representations from Hundreds of Millions of Molecules [J].
Chen, Dong ;
Zheng, Jiaxin ;
Wei, Guo-Wei ;
Pan, Feng .
JOURNAL OF PHYSICAL CHEMISTRY LETTERS, 2021, 12 (44) :10793-10801
[5]  
Chithrananda S, 2020, Arxiv, DOI [arXiv:2010.09885, 10.48550/arXiv.2010.09885]
[6]   Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction [J].
Coley, Connor W. ;
Barzilay, Regina ;
Green, William H. ;
Jaakkola, Tommi S. ;
Jensen, Klavs F. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2017, 57 (08) :1757-1772
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Duvenaudt D, 2015, ADV NEUR IN, V28
[9]  
Fabian B, 2020, Arxiv, DOI arXiv:2011.13230
[10]   ChEMBL: a large-scale bioactivity database for drug discovery [J].
Gaulton, Anna ;
Bellis, Louisa J. ;
Bento, A. Patricia ;
Chambers, Jon ;
Davies, Mark ;
Hersey, Anne ;
Light, Yvonne ;
McGlinchey, Shaun ;
Michalovich, David ;
Al-Lazikani, Bissan ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D1100-D1107