Investigating the determinants of performance in machine learning for protein fitness prediction

被引:0
作者
Sandhu, Mahakaran [1 ]
Mater, Adam C. [1 ]
Matthews, Dana S. [1 ,2 ]
Spence, Matthew A. [1 ,2 ]
Lenskiy, Artem A. [3 ]
Jackson, Colin [1 ,2 ,4 ]
机构
[1] Australian Natl Univ, Res Sch Chem, Canberra, ACT 2601, Australia
[2] Australian Natl Univ, ARC Ctr Excellence Innovat Peptide & Prot Sci, Res Sch Chem, Canberra, ACT, Australia
[3] Univ New South Wales, Sch Engn & Technol, Canberra, ACT, Australia
[4] Australian Natl Univ, ARC Ctr Excellence Synthet Biol, Res Sch Biol, Canberra, ACT, Australia
关键词
epistasis; machine learning; mutational effect prediction; performance determinants; protein fitness landscapes; synthetic fitness landscapes; EPISTASIS; LANDSCAPES; EVOLUTION; APPROXIMATION; BINDING; ENERGY; MODEL;
D O I
10.1002/pro.70235
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.
引用
收藏
页数:18
相关论文
共 83 条
[1]   Accurate structure prediction of biomolecular interactions with AlphaFold 3 [J].
Abramson, Josh ;
Adler, Jonas ;
Dunger, Jack ;
Evans, Richard ;
Green, Tim ;
Pritzel, Alexander ;
Ronneberger, Olaf ;
Willmore, Lindsay ;
Ballard, Andrew J. ;
Bambrick, Joshua ;
Bodenstein, Sebastian W. ;
Evans, David A. ;
Hung, Chia-Chun ;
O'Neill, Michael ;
Reiman, David ;
Tunyasuvunakool, Kathryn ;
Wu, Zachary ;
Zemgulyte, Akvile ;
Arvaniti, Eirini ;
Beattie, Charles ;
Bertolli, Ottavia ;
Bridgland, Alex ;
Cherepanov, Alexey ;
Congreve, Miles ;
Cowen-Rivers, Alexander I. ;
Cowie, Andrew ;
Figurnov, Michael ;
Fuchs, Fabian B. ;
Gladman, Hannah ;
Jain, Rishub ;
Khan, Yousuf A. ;
Low, Caroline M. R. ;
Perlin, Kuba ;
Potapenko, Anna ;
Savy, Pascal ;
Singh, Sukhdeep ;
Stecula, Adrian ;
Thillaisundaram, Ashok ;
Tong, Catherine ;
Yakneen, Sergei ;
Zhong, Ellen D. ;
Zielinski, Michal ;
Zidek, Augustin ;
Bapst, Victor ;
Kohli, Pushmeet ;
Jaderberg, Max ;
Hassabis, Demis ;
Jumper, John M. .
NATURE, 2024, 630 (8016) :493-500
[2]   Fitness spectrum among random mutants on Mt Fuji-type fitness landscape [J].
Aita, T ;
Husimi, Y .
JOURNAL OF THEORETICAL BIOLOGY, 1996, 182 (04) :469-485
[3]   A cross-section of the fitness landscape of dihydrofolate reductase [J].
Aita, T ;
Iwakura, M ;
Husimi, Y .
PROTEIN ENGINEERING, 2001, 14 (09) :633-638
[4]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[5]   Unified rational protein engineering with sequence-based deep representation learning [J].
Alley, Ethan C. ;
Khimulya, Grigory ;
Biswas, Surojit ;
AlQuraishi, Mohammed ;
Church, George M. .
NATURE METHODS, 2019, 16 (12) :1315-+
[6]  
[Anonymous], Advances in neural information processing systems
[7]  
Anthony M., 1999, NEURAL NETWORK LEARN, V9
[8]   Accurate prediction of protein structures and interactions using a three-track neural network [J].
Baek, Minkyung ;
DiMaio, Frank ;
Anishchenko, Ivan ;
Dauparas, Justas ;
Ovchinnikov, Sergey ;
Lee, Gyu Rie ;
Wang, Jue ;
Cong, Qian ;
Kinch, Lisa N. ;
Schaeffer, R. Dustin ;
Millan, Claudia ;
Park, Hahnbeom ;
Adams, Carson ;
Glassman, Caleb R. ;
DeGiovanni, Andy ;
Pereira, Jose H. ;
Rodrigues, Andria V. ;
van Dijk, Alberdina A. ;
Ebrecht, Ana C. ;
Opperman, Diederik J. ;
Sagmeister, Theo ;
Buhlheller, Christoph ;
Pavkov-Keller, Tea ;
Rathinaswamy, Manoj K. ;
Dalwadi, Udit ;
Yip, Calvin K. ;
Burke, John E. ;
Garcia, K. Christopher ;
Grishin, Nick V. ;
Adams, Paul D. ;
Read, Randy J. ;
Baker, David .
SCIENCE, 2021, 373 (6557) :871-+
[9]  
Barnett L, 1998, FROM ANIM ANIMAT, P18
[10]   UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION [J].
BARRON, AR .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1993, 39 (03) :930-945