Genomic language models: opportunities and challenges

被引:7
作者
Benegas, Gonzalo [1 ]
Ye, Chengzhong [2 ]
Albors, Carlos [1 ]
Li, Jianan Canal [1 ]
Song, Yun S. [1 ,2 ,3 ]
机构
[1] Univ Calif Berkeley, Comp Sci Div, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
基金
美国国家卫生研究院;
关键词
PROTEIN-STRUCTURE; GENES; PREDICTION; EVOLUTION; VARIANTS; DATABASE; DNA;
D O I
10.1016/j.tig.2024.11.013
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
引用
收藏
页码:286 / 302
页数:17
相关论文
共 132 条
[1]  
Achiam J., 2023, Open AI GPT-4 technical report, DOI [DOI 10.48550/ARXIV.2303.08774, 10.48550/arxiv.2303.08774]
[2]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[3]   OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders [J].
Amberger, Joanna S. ;
Bocchini, Carol A. ;
Schiettecatte, Francois ;
Scott, Alan F. ;
Hamosh, Ada .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D789-D798
[4]   Progressive Cactus is a multiple-genome aligner for the thousand-genome era [J].
Armstrong, Joel ;
Hickey, Glenn ;
Diekhans, Mark ;
Fiddes, Ian T. ;
Novak, Adam M. ;
Deran, Alden ;
Fang, Qi ;
Xie, Duo ;
Feng, Shaohong ;
Stiller, Josefin ;
Genereux, Diane ;
Johnson, Jeremy ;
Marinescu, Voichita Dana ;
Alfoldi, Jessica ;
Harris, Robert S. ;
Lindblad-Toh, Kerstin ;
Haussler, David ;
Karlsson, Elinor ;
Jarvis, Erich D. ;
Zhang, Guojie ;
Paten, Benedict .
NATURE, 2020, 587 (7833) :246-+
[5]   Effective gene expression prediction from sequence by integrating long-range interactions [J].
Avsec, Ziga ;
Agarwal, Vikram ;
Visentin, Daniel ;
Ledsam, Joseph R. ;
Grabska-Barwinska, Agnieszka ;
Taylor, Kyle R. ;
Assael, Yannis ;
Jumper, John ;
Kohli, Pushmeet ;
Kelley, David R. .
NATURE METHODS, 2021, 18 (10) :1196-+
[6]   Identification of bacteriophage genome sequences with representation learning [J].
Bai, Zeheng ;
Zhang, Yao-zhong ;
Miyano, Satoru ;
Yamaguchi, Rui ;
Fujimoto, Kosuke ;
Uematsu, Satoshi ;
Imoto, Seiya .
BIOINFORMATICS, 2022, 38 (18) :4264-4270
[7]   UniProt: the Universal Protein Knowledgebase in 2023 [J].
Bateman, Alex ;
Martin, Maria-Jesus ;
Orchard, Sandra ;
Magrane, Michele ;
Ahmad, Shadab ;
Alpi, Emanuele ;
Bowler-Barnett, Emily H. ;
Britto, Ramona ;
Cukura, Austra ;
Denny, Paul ;
Dogan, Tunca ;
Ebenezer, ThankGod ;
Fan, Jun ;
Garmiri, Penelope ;
Gonzales, Leonardo Jose da Costa ;
Hatton-Ellis, Emma ;
Hussein, Abdulrahman ;
Ignatchenko, Alexandr ;
Insana, Giuseppe ;
Ishtiaq, Rizwan ;
Joshi, Vishal ;
Jyothi, Dushyanth ;
Kandasaamy, Swaathi ;
Lock, Antonia ;
Luciani, Aurelien ;
Lugaric, Marija ;
Luo, Jie ;
Lussi, Yvonne ;
MacDougall, Alistair ;
Madeira, Fabio ;
Mahmoudy, Mahdi ;
Mishra, Alok ;
Moulang, Katie ;
Nightingale, Andrew ;
Pundir, Sangya ;
Qi, Guoying ;
Raj, Shriya ;
Raposo, Pedro ;
Rice, Daniel L. ;
Saidi, Rabie ;
Santos, Rafael ;
Speretta, Elena ;
Stephenson, James ;
Totoo, Prabhat ;
Turner, Edward ;
Tyagi, Nidhi ;
Vasudev, Preethi ;
Warner, Kate ;
Watkins, Xavier ;
Zellner, Hermann .
NUCLEIC ACIDS RESEARCH, 2023, 51 (D1) :D523-D531
[8]   A DNA language model based on multispecies alignment predicts the effects of genome-wide variants [J].
Benegas, Gonzalo ;
Albors, Carlos ;
Aw, Alan J. ;
Ye, Chengzhong ;
Song, Yun S. .
NATURE BIOTECHNOLOGY, 2025,
[9]   DNA language models are powerful predictors of genome-wide variant effects [J].
Benegas, Gonzalo ;
Batra, Sanjit Singh ;
Song, Yun S. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (44)
[10]   Learning the protein language: Evolution, structure, and function [J].
Bepler, Tristan ;
Berger, Bonnie .
CELL SYSTEMS, 2021, 12 (06) :654-+