Progress and opportunities of foundation models in bioinformatics

被引:6
作者
Li, Qing [1 ]
Hu, Zhihang [1 ]
Wang, Yixuan [1 ]
Li, Lei [1 ]
Fan, Yimin [1 ]
King, Irwin [1 ]
Jia, Gengjie [2 ]
Wang, Sheng [3 ,4 ]
Song, Le [5 ]
Li, Yu [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong 999077, Peoples R China
[2] Chinese Acad Agr Sci, Agr Genom Inst Shenzhen, Guangdong Lab Lingnan Modern Agr, Shenzhen Branch,Genome Anal Lab,Minist Agr & Rural, Shenzhen 518120, Guangdong, Peoples R China
[3] Shanghai Zelixir Biotech Co Ltd, Shanghai 200030, Peoples R China
[4] Shenzhen Inst Adv Technol, Xueyuan Ave, Shenzhen 518055, Guangdong, Peoples R China
[5] BioMap, Zhongguancun Life Sci Pk, Beijing 100085, Peoples R China
关键词
foundation models; large language models; bioinformatics; artificial intelligence; CELL RNA-SEQ; PROTEIN-STRUCTURE; LANGUAGE; IDENTIFICATION; PREDICTION;
D O I
10.1093/bib/bbae548
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.
引用
收藏
页数:20
相关论文
共 134 条
[1]  
Ainslie J, 2023, Arxiv, DOI arXiv:2305.13245
[2]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[3]   DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data [J].
Arisdakessian, Cedric ;
Poirion, Olivier ;
Yunits, Breck ;
Zhu, Xun ;
Garmire, Lana X. .
GENOME BIOLOGY, 2019, 20 (01)
[4]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[5]   Effective gene expression prediction from sequence by integrating long-range interactions [J].
Avsec, Ziga ;
Agarwal, Vikram ;
Visentin, Daniel ;
Ledsam, Joseph R. ;
Grabska-Barwinska, Agnieszka ;
Taylor, Kyle R. ;
Assael, Yannis ;
Jumper, John ;
Kohli, Pushmeet ;
Kelley, David R. .
NATURE METHODS, 2021, 18 (10) :1196-+
[6]   Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication [J].
Azher, Zarif L. ;
Suvarna, Anish ;
Chen, Ji-Qing ;
Zhang, Ze ;
Christensen, Brock C. ;
Salas, Lucas A. ;
Vaickus, Louis J. ;
Levy, Joshua J. .
BIODATA MINING, 2023, 16 (01)
[7]  
Babjac AN., 2023, P 14 ACM INT C BIOIN, P16
[8]  
Baker B., 2022, Advances in Neural Information Processing Systems, V35, P24639
[9]   Solo: Doublet Identification in Single-Cell RNA-Seq via Semi-Supervised Deep Learning [J].
Bernstein, Nicholas J. ;
Fong, Nicole L. ;
Lam, Irene ;
Roy, Margaret A. ;
Hendrickson, David G. ;
Kelley, David R. .
CELL SYSTEMS, 2020, 11 (01) :95-+
[10]  
Bolton E., 2022, BioMedLM