DNA language models are powerful predictors of genome-wide variant effects

被引:38
作者
Benegas, Gonzalo [1 ]
Batra, Sanjit Singh [2 ]
Song, Yun S. [2 ,3 ,4 ]
机构
[1] Univ Calif Berkeley, Grad Grp Computat Biol, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Comp Sci Div, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[4] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
关键词
machine learning; language models; variant effect prediction; genome-wide association study; Arabidopsis thaliana; ARABIDOPSIS; IMPACT;
D O I
10.1073/pnas.2311219120
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
引用
收藏
页数:9
相关论文
共 63 条
[1]   1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana [J].
Alonso-Blanco, Carlos ;
Andrade, Jorge ;
Becker, Claude ;
Bemm, Felix ;
Bergelson, Joy ;
Borgwardt, Karsten M. ;
Cao, Jun ;
Chae, Eunyoung ;
Dezwaan, Todd M. ;
Ding, Wei ;
Ecker, Joseph R. ;
Exposito-Alonso, Moises ;
Farlow, Ashley ;
Fitz, Joffrey ;
Gan, Xiangchao ;
Grimm, Dominik G. ;
Hancock, Angela M. ;
Henz, Stefan R. ;
Holm, Svante ;
Horton, Matthew ;
Jarsulic, Mike ;
Kerstetter, Randall A. ;
Korte, Arthur ;
Korte, Pamela ;
Lanz, Christa ;
Lee, Cheng-Ruei ;
Meng, Dazhe ;
Michael, Todd P. ;
Mott, Richard ;
Muliyati, Ni Wayan ;
Nagele, Thomas ;
Nagler, Matthias ;
Nizhynska, Viktoria ;
Nordborg, Magnus ;
Novikova, Polina Yu. ;
Pico, F. Xavier ;
Platzer, Alexander ;
Rabanal, Fernando A. ;
Rodriguez, Alex ;
Rowan, Beth A. ;
Salome, Patrice A. ;
Schmid, Karl J. ;
Schmitz, Robert J. ;
Seren, Umit ;
Sperone, Felice Gianluca ;
Sudkamp, Mitchell ;
Svardal, Hannes ;
Tanzer, Matt M. ;
Todd, Donald ;
Volchenboum, Samuel L. .
CELL, 2016, 166 (02) :481-491
[2]   Effective gene expression prediction from sequence by integrating long-range interactions [J].
Avsec, Ziga ;
Agarwal, Vikram ;
Visentin, Daniel ;
Ledsam, Joseph R. ;
Grabska-Barwinska, Agnieszka ;
Taylor, Kyle R. ;
Assael, Yannis ;
Jumper, John ;
Kohli, Pushmeet ;
Kelley, David R. .
NATURE METHODS, 2021, 18 (10) :1196-+
[3]   Identification of bacteriophage genome sequences with representation learning [J].
Bai, Zeheng ;
Zhang, Yao-zhong ;
Miyano, Satoru ;
Yamaguchi, Rui ;
Fujimoto, Kosuke ;
Uematsu, Satoshi ;
Imoto, Seiya .
BIOINFORMATICS, 2022, 38 (18) :4264-4270
[4]  
Batra S. S., GPN code. GPN Github repository
[5]   Ten things you should know about transposable elements [J].
Bourque, Guillaume ;
Burns, Kathleen H. ;
Gehring, Mary ;
Gorbunova, Vera ;
Seluanov, Andrei ;
Hammell, Molly ;
Imbeault, Michael ;
Izsvak, Zsuzsanna ;
Levin, Henry L. ;
Macfarlan, Todd S. ;
Mager, Dixie L. ;
Feschotte, Cedric .
GENOME BIOLOGY, 2018, 19
[6]   Open problems in human trait genetics [J].
Brandes, Nadav ;
Weissbrod, Omer ;
Linial, Michal .
GENOME BIOLOGY, 2022, 23 (01)
[7]  
Bubeck S, 2023, Arxiv, DOI [arXiv:2303.12712, DOI 10.48550/ARXIV.2303.12712]
[8]   LD Score regression distinguishes confounding from polygenicity in genome-wide association studies [J].
Bulik-Sullivan, Brendan K. ;
Loh, Po-Ru ;
Finucane, Hilary K. ;
Ripke, Stephan ;
Yang, Jian ;
Patterson, Nick ;
Daly, Mark J. ;
Price, Alkes L. ;
Neale, Benjamin M. .
NATURE GENETICS, 2015, 47 (03) :291-+
[9]   A sequence-based global map of regulatory activity for deciphering human genetics [J].
Chen, Kathleen M. ;
Wong, Aaron K. ;
Troyanskaya, Olga G. ;
Zhou, Jian .
NATURE GENETICS, 2022, 54 (07) :940-+
[10]   MMSplice: modular modeling improves the predictions of genetic variant effects on splicing [J].
Cheng, Jun ;
Thi Yen Duong Nguyen ;
Cygan, Kamil J. ;
Celik, Muhammed Hasan ;
Fairbrother, William G. ;
Avsec, Ziga ;
Gagneur, Julien .
GENOME BIOLOGY, 2019, 20 (1)