AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature

被引:18
作者
Birgmeier, Johannes [1 ]
Deisseroth, Cole A. [1 ]
Hayward, Laura E. [2 ]
Galhardo, Luisa M. T. [1 ]
Tierno, Andrew P. [1 ]
Jagadeesh, Karthik A. [1 ]
Stenson, Peter D. [3 ]
Cooper, David N. [3 ]
Bernstein, Jonathan A. [4 ]
Haeussler, Maximilian [5 ]
Bejerano, Gill [1 ,4 ,6 ,7 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[2] Stanford Univ, Sch Med, Dept Genet, Stanford, CA 94305 USA
[3] Cardiff Univ, Sch Med, Inst Med Genet, Heath Pk, Cardiff, Wales
[4] Stanford Sch Med, Dept Pediat, Stanford, CA 94305 USA
[5] Univ Calif Santa Cruz, MS CBSE, Santa Cruz Genom Inst, Santa Cruz, CA 95064 USA
[6] Stanford Univ, Dept Dev Biol, Stanford, CA 94305 USA
[7] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA
基金
英国惠康基金;
关键词
automatic variant retrieval; machine learning; natural language processing; full-text extraction; variants database; SEQUENCE VARIANTS; DATABASE; CLINVAR; ARCHIVE; TMVAR; DBSNP;
D O I
10.1038/s41436-019-0643-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Purpose Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach. Methods Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates. Results AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar's 21, versus only 2 using the best current automated approach. Conclusion AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.
引用
收藏
页码:362 / 370
页数:9
相关论文
共 32 条
[1]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[2]   OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders [J].
Amberger, Joanna S. ;
Bocchini, Carol A. ;
Schiettecatte, Francois ;
Scott, Alan F. ;
Hamosh, Ada .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D789-D798
[3]  
[Anonymous], CLIN GEN CLINGEN RES
[4]   The variant call format and VCFtools [J].
Danecek, Petr ;
Auton, Adam ;
Abecasis, Goncalo ;
Albers, Cornelis A. ;
Banks, Eric ;
DePristo, Mark A. ;
Handsaker, Robert E. ;
Lunter, Gerton ;
Marth, Gabor T. ;
Sherry, Stephen T. ;
McVean, Gilean ;
Durbin, Richard .
BIOINFORMATICS, 2011, 27 (15) :2156-2158
[5]   ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis [J].
Deisseroth, Cole A. ;
Birgmeier, Johannes ;
Bodle, Ethan E. ;
Kohler, Jennefer N. ;
Matalon, Dena R. ;
Nazarenko, Yelena ;
Genetti, Casie A. ;
Brownstein, Catherine A. ;
Schmitz-Abe, Klaus ;
Schoch, Kelly ;
Cope, Heidi ;
Signer, Rebecca ;
Network, Undiagnosed Dis ;
Martinez-Agosto, Julian A. ;
Shashi, Vandana ;
Beggs, Alan H. ;
Wheeler, Matthew T. ;
Bernstein, Jonathan A. ;
Bejerano, Gill .
GENETICS IN MEDICINE, 2019, 21 (07) :1585-1593
[6]   Clinical Interpretation and Implications of Whole-Genome Sequencing [J].
Dewey, Frederick E. ;
Grove, Megan E. ;
Pan, Cuiping ;
Goldstein, Benjamin A. ;
Bernstein, Jonathan A. ;
Chaib, Hassan ;
Merker, Jason D. ;
Goldfeder, Rachel L. ;
Enns, Gregory M. ;
David, Sean P. ;
Pakdaman, Neda ;
Ormond, Kelly E. ;
Caleshu, Colleen ;
Kingham, Kerry ;
Klein, Teri E. ;
Whirl-Carrillo, Michelle ;
Sakamoto, Kenneth ;
Wheeler, Matthew T. ;
Butte, Atul J. ;
Ford, James M. ;
Boxer, Linda ;
Ioannidis, John P. A. ;
Yeung, Alan C. ;
Altman, Russ B. ;
Assimes, Themistocles L. ;
Snyder, Michael ;
Ashley, Euan A. ;
Quertermous, Thomas .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2014, 311 (10) :1035-1044
[7]   Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature [J].
Doughty, Emily ;
Kertesz-Farkas, Attila ;
Bodenreider, Olivier ;
Thompson, Gary ;
Adadey, Asa ;
Peterson, Thomas ;
Kann, Maricel G. .
BIOINFORMATICS, 2011, 27 (03) :408-415
[8]   Large-scale discovery of novel genetic causes of developmental disorders [J].
Fitzgerald, T. W. ;
Gerety, S. S. ;
Jones, W. D. ;
van Kogelenberg, M. ;
King, D. A. ;
McRae, J. ;
Morley, K. I. ;
Parthiban, V. ;
Al-Turki, S. ;
Ambridge, K. ;
Barrett, D. M. ;
Bayzetinova, T. ;
Clayton, S. ;
Coomber, E. L. ;
Gribble, S. ;
Jones, P. ;
Krishnappa, N. ;
Mason, L. E. ;
Middleton, A. ;
Miller, R. ;
Prigmore, E. ;
Rajan, D. ;
Sifrim, A. ;
Tivey, A. R. ;
Ahmed, M. ;
Akawi, N. ;
Andrews, R. ;
Anjum, U. ;
Archer, H. ;
Armstrong, R. ;
Balasubramanian, M. ;
Banerjee, R. ;
Baralle, D. ;
Batstone, P. ;
Baty, D. ;
Bennett, C. ;
Berg, J. ;
Bernhard, B. ;
Bevan, A. P. ;
Blair, E. ;
Blyth, M. ;
Bohanna, D. ;
Bourdon, L. ;
Bourn, D. ;
Brady, A. ;
Bragin, E. ;
Brewer, C. ;
Brueton, L. ;
Brunstrom, K. ;
Bumpstead, S. J. .
NATURE, 2015, 519 (7542) :223-+
[9]   Genenames.org: the HGNC resources in 2015 [J].
Gray, Kristian A. ;
Yates, Bethan ;
Seal, Ruth L. ;
Wright, Mathew W. ;
Bruford, Elspeth A. .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D1079-D1085
[10]   Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization [J].
Jagadeesh, Karthik A. ;
Birgmeier, Johannes ;
Guturu, Harendra ;
Deisseroth, Cole A. ;
Wenger, Aaron M. ;
Bernstein, Jonathan A. ;
Bejerano, Gill .
GENETICS IN MEDICINE, 2019, 21 (02) :464-470