Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

被引:4
作者
Han, Seong Kyu [1 ,5 ]
Muto, Yoshiharu [2 ]
Wilson, Parker C. [3 ]
Humphreys, Benjamin D. [2 ,4 ]
Sampson, Matthew G. [1 ,5 ]
Chakravarti, Aravinda [6 ]
Lee, Dongwo [1 ,7 ]
机构
[1] Boston & Harvard Med Sch, Boston Childrens Hosp, Dept Pediat, Div Nephrol, Boston, MA 02115 USA
[2] Washington Univ St Louis, Dept Med, Div Nephrol, St Louis, MO 63130 USA
[3] Washington Univ St Louis, Dept Pathol & munol, St Louis, MO 63130 USA
[4] Washington Univ St Louis, Dept Dev Biol, St Louis, MO 63130 USA
[5] Broad Inst & Harvard, Kidney Dis Initiat, Cambridge, MA 02142 USA
[6] New York Univ, Ctr Human Genet & Genom, Grossman Sch Med, New York, NY 10016 USA
[7] Boston Childrens Hosp, Manton Ctr Orphan Res, Boston, MA 02115 USA
关键词
quality control; chromatin accessibility; sequence-based model; gkmQC; GENOME-WIDE ASSOCIATION; BINDING PROTEINS; DNA; VISUALIZATION; ENHANCERS; VARIANTS; ENCODE; LMX1B; CHIP;
D O I
10.1073/pnas.2212810119
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) sam-ples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
引用
收藏
页数:11
相关论文
共 60 条
[1]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[2]   Determinants of enhancer and promoter activities of regulatory elements [J].
Andersson, Robin ;
Sandelin, Albin .
NATURE REVIEWS GENETICS, 2020, 21 (02) :71-87
[3]   An atlas of active enhancers across human cell types and tissues [J].
Andersson, Robin ;
Gebhard, Claudia ;
Miguel-Escalada, Irene ;
Hoof, Ilka ;
Bornholdt, Jette ;
Boyd, Mette ;
Chen, Yun ;
Zhao, Xiaobei ;
Schmidl, Christian ;
Suzuki, Takahiro ;
Ntini, Evgenia ;
Arner, Erik ;
Valen, Eivind ;
Li, Kang ;
Schwarzfischer, Lucia ;
Glatz, Dagmar ;
Raithel, Johanna ;
Lilje, Berit ;
Rapin, Nicolas ;
Bagger, Frederik Otzen ;
Jorgensen, Mette ;
Andersen, Peter Refsing ;
Bertin, Nicolas ;
Rackham, Owen ;
Burroughs, A. Maxwell ;
Baillie, J. Kenneth ;
Ishizu, Yuri ;
Shimizu, Yuri ;
Furuhata, Erina ;
Maeda, Shiori ;
Negishi, Yutaka ;
Mungall, Christopher J. ;
Meehan, Terrence F. ;
Lassmann, Timo ;
Itoh, Masayoshi ;
Kawaji, Hideya ;
Kondo, Naoto ;
Kawai, Jun ;
Lennartsson, Andreas ;
Daub, Carsten O. ;
Heutink, Peter ;
Hume, David A. ;
Jensen, Torben Heick ;
Suzuki, Harukazu ;
Hayashizaki, Yoshihide ;
Mueller, Ferenc ;
Forrest, Alistair R. R. ;
Carninci, Piero ;
Rehli, Michael ;
Sandelin, Albin .
NATURE, 2014, 507 (7493) :455-+
[4]   Base-resolution models of transcription-factor binding reveal soft motif syntax [J].
Avsec, Ziga ;
Weilert, Melanie ;
Shrikumar, Avanti ;
Krueger, Sabrina ;
Alexandari, Amr ;
Dalal, Khyati ;
Fropf, Robin ;
McAnany, Charles ;
Gagneur, Julien ;
Kundaje, Anshul ;
Zeitlinger, Julia .
NATURE GENETICS, 2021, 53 (03) :354-+
[5]   High-resolution mapping and characterization of open chromatin across the genome [J].
Boyle, Alan P. ;
Davis, Sean ;
Shulha, Hennady P. ;
Meltzer, Paul ;
Margulies, Elliott H. ;
Weng, Zhiping ;
Furey, Terrence S. ;
Crawford, Gregory E. .
CELL, 2008, 132 (02) :311-322
[6]  
Buenrostro JD, 2013, NAT METHODS, V10, P1213, DOI [10.1038/NMETH.2688, 10.1038/nmeth.2688]
[7]   LD Score regression distinguishes confounding from polygenicity in genome-wide association studies [J].
Bulik-Sullivan, Brendan K. ;
Loh, Po-Ru ;
Finucane, Hilary K. ;
Ripke, Stephan ;
Yang, Jian ;
Patterson, Nick ;
Daly, Mark J. ;
Price, Alkes L. ;
Neale, Benjamin M. .
NATURE GENETICS, 2015, 47 (03) :291-+
[8]   LMX1B is Essential for the Maintenance of Differentiated Podocytes in Adult Kidneys [J].
Burghardt, Tillmann ;
Kastner, Juergen ;
Suleiman, Hani ;
Rivera-Milla, Eric ;
Stepanova, Natalya ;
Lottaz, Claudio ;
Kubitza, Marion ;
Boeger, Carsten A. ;
Schmidt, Sarah ;
Gorski, Mathias ;
de Vries, Uwe ;
Schmidt, Helga ;
Hertting, Irmgard ;
Kopp, Jeffrey ;
Rascle, Anne ;
Moser, Markus ;
Heid, Iris M. ;
Warth, Richard ;
Spang, Rainer ;
Wegener, Joachim ;
Mierke, Claudia T. ;
Englert, Christoph ;
Witzgall, Ralph .
JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY, 2013, 24 (11) :1830-1848
[9]   The UK Biobank resource with deep phenotyping and genomic data [J].
Bycroft, Clare ;
Freeman, Colin ;
Petkova, Desislava ;
Band, Gavin ;
Elliott, Lloyd T. ;
Sharp, Kevin ;
Motyer, Allan ;
Vukcevic, Damjan ;
Delaneau, Olivier ;
O'Connell, Jared ;
Cortes, Adrian ;
Welsh, Samantha ;
Young, Alan ;
Effingham, Mark ;
McVean, Gil ;
Leslie, Stephen ;
Allen, Naomi ;
Donnelly, Peter ;
Marchini, Jonathan .
NATURE, 2018, 562 (7726) :203-+
[10]   The Encyclopedia of DNA elements (ENCODE): data portal update [J].
Davis, Carrie A. ;
Hitz, Benjamin C. ;
Sloan, Cricket A. ;
Chan, Esther T. ;
Davidson, Jean M. ;
Gabdank, Idan ;
Hilton, Jason A. ;
Jain, Kriti ;
Baymuradov, Ulugbek K. ;
Narayanan, Aditi K. ;
Onate, Kathrina C. ;
Graham, Keenan ;
Miyasato, Stuart R. ;
Dreszer, Timothy R. ;
Strattan, J. Seth ;
Jolanki, Otto ;
Tanaka, Forrest Y. ;
Cherry, J. Michael .
NUCLEIC ACIDS RESEARCH, 2018, 46 (D1) :D794-D801