A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

被引:8
作者
Benegas, Gonzalo [1 ,2 ]
Albors, Carlos [2 ]
Aw, Alan J. [3 ]
Ye, Chengzhong [3 ]
Song, Yun S. [2 ,3 ,4 ]
机构
[1] Univ Calif Berkeley, Grad Grp Computat Biol, Berkeley, CA USA
[2] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[4] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
基金
美国国家卫生研究院;
关键词
IDENTIFICATION; ASSOCIATION;
D O I
10.1038/s41587-024-02511-w
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all similar to 9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.
引用
收藏
页数:22
相关论文
共 56 条
[1]   Relating pathogenic loss-of-function mutations in humans to their evolutionary fitness costs [J].
Agarwal, Ipsita ;
Fuller, Zachary L. ;
Myers, Simon R. ;
Przeworski, Molly .
ELIFE, 2023, 12
[2]   Identification of two novel mutations in Shh long-range regulator associated with familial pre-axial polydactyly [J].
Albuisson, J. ;
Isidor, B. ;
Giraud, M. ;
Pichon, O. ;
Marsaud, T. ;
David, A. ;
Le Caignec, C. ;
Bezieau, S. .
CLINICAL GENETICS, 2011, 79 (04) :371-377
[3]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[4]   Severe factor VII deficiency due to a mutation disrupting a hepatocyte nuclear factor 4 binding site in the factor VII promoter [J].
Arbini, AA ;
Pollak, ES ;
Bayleran, JK ;
High, KA ;
Bauer, KA .
BLOOD, 1997, 89 (01) :176-182
[5]   Whole-Genome Alignment and Comparative Annotation [J].
Armstrong, Joel ;
Fiddes, Ian T. ;
Diekhans, Mark ;
Paten, Benedict .
ANNUAL REVIEW OF ANIMAL BIOSCIENCES, VOL 7, 2019, 7 :41-64
[6]   Effective gene expression prediction from sequence by integrating long-range interactions [J].
Avsec, Ziga ;
Agarwal, Vikram ;
Visentin, Daniel ;
Ledsam, Joseph R. ;
Grabska-Barwinska, Agnieszka ;
Taylor, Kyle R. ;
Assael, Yannis ;
Jumper, John ;
Kohli, Pushmeet ;
Kelley, David R. .
NATURE METHODS, 2021, 18 (10) :1196-+
[7]  
Aw A. J., 2024, bioRxiv, DOI [10.1101/2024.01.27.577589, DOI 10.1101/2024.01.27.577589]
[8]  
Benegas G., 2024, GPN repository. GitHub
[9]   DNA language models are powerful predictors of genome-wide variant effects [J].
Benegas, Gonzalo ;
Batra, Sanjit Singh ;
Song, Yun S. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (44)
[10]   Aligning multiple genomic sequences with the threaded blockset aligner [J].
Blanchette, M ;
Kent, WJ ;
Riemer, C ;
Elnitski, L ;
Smit, AFA ;
Roskin, KM ;
Baertsch, R ;
Rosenbloom, K ;
Clawson, H ;
Green, ED ;
Haussler, D ;
Miller, W .
GENOME RESEARCH, 2004, 14 (04) :708-715