NeuronMotif: Deciphering cis-regulatory codes by layer-wise demixing of deep neural networks

被引:9
作者
Wei, Zheng [1 ,2 ]
Hua, Kui [1 ]
Wei, Lei [1 ]
Ma, Shining [3 ,4 ]
Jiang, Rui [1 ]
Zhang, Xuegong [1 ]
Li, Yanda [1 ]
Wong, Wing H. [3 ,4 ,5 ]
Wang, Xiaowo [1 ]
机构
[1] Tsinghua Univ, Dept Automat, Beijing Natl Res Ctr Informat Sci & Technol, Ctr Synthet & Syst Biol,Minist Educ,Key Lab Bioinf, Beijing 100084, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
[3] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA
[5] Stanford Univ, Ctr Personal Dynam Regulomes, BioX Program, Stanford, CA 94305 USA
基金
中国国家自然科学基金;
关键词
cis-regulatory grammar; motif combination; deep neural network; model interpretation; multifaceted neuron; BINDING PROTEINS; NOVO DISCOVERY; OPEN CHROMATIN; DNA; SEQUENCE; GENOME; SPECIFICITIES; ACCESSIBILITY; MODULES;
D O I
10.1073/pnas.2216698120
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show that the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in the network, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are "demixed" in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures. Compared to existing methods, the motifs found by NeuronMotif have more matches to known motifs in the JASPAR database. The higher-order patterns uncovered for deep CNs are supported by the literature and ATAC-seq footprinting. Overall, NeuronMotif enables the deciphering of cis-regulatory codes from deep CNs and enhances the utility of CNN in genome interpretation.
引用
收藏
页数:12
相关论文
共 50 条
[1]   Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].
Alipanahi, Babak ;
Delong, Andrew ;
Weirauch, Matthew T. ;
Frey, Brendan J. .
NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+
[2]   Base-resolution models of transcription-factor binding reveal soft motif syntax [J].
Avsec, Ziga ;
Weilert, Melanie ;
Shrikumar, Avanti ;
Krueger, Sabrina ;
Alexandari, Amr ;
Dalal, Khyati ;
Fropf, Robin ;
McAnany, Charles ;
Gagneur, Julien ;
Kundaje, Anshul ;
Zeitlinger, Julia .
NATURE GENETICS, 2021, 53 (03) :354-+
[3]  
Bailey T L, 1994, Proc Int Conf Intell Syst Mol Biol, V2, P28
[4]   ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation [J].
Bentsen, Mette ;
Goymann, Philipp ;
Schultheis, Hendrik ;
Klee, Kathrin ;
Petrova, Anastasiia ;
Wiegandt, Rene ;
Fust, Annika ;
Preussner, Jens ;
Kuenne, Carsten ;
Braun, Thomas ;
Kim, Johnny ;
Looso, Mario .
NATURE COMMUNICATIONS, 2020, 11 (01)
[5]   A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation [J].
Bogard, Nicholas ;
Linder, Johannes ;
Rosenberg, Alexander B. ;
Seelig, Georg .
CELL, 2019, 178 (01) :91-+
[6]  
Buenrostro JD, 2013, NAT METHODS, V10, P1213, DOI [10.1038/NMETH.2688, 10.1038/nmeth.2688]
[7]   Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling [J].
Calviello, Aslihan Karabacak ;
Hirsekorn, Antje ;
Wurmus, Ricardo ;
Yusuf, Dilmurat ;
Ohler, Uwe .
GENOME BIOLOGY, 2019, 20 (1)
[8]   JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles [J].
Castro-Mondragon, Jaime A. ;
Riudavets-Puig, Rafael ;
Rauluseviciute, Ieva ;
Lemma, Roza Berhanu ;
Turchi, Laura ;
Blanc-Mathieu, Romain ;
Lucas, Jeremy ;
Boddie, Paul ;
Khan, Aziz ;
Perez, Nicolas Manosalva ;
Fornes, Oriol ;
Leung, Tiffany Y. ;
Aguirre, Alejandro ;
Hammal, Fayrouz ;
Schmelter, Daniel ;
Baranasic, Damir ;
Ballester, Benoit ;
Sandelin, Albin ;
Lenhard, Boris ;
Vandepoele, Klaas ;
Wasserman, Wyeth W. ;
Parcy, Francois ;
Mathelier, Anthony .
NUCLEIC ACIDS RESEARCH, 2022, 50 (D1) :D165-D173
[9]   Selene: a PyTorch-based deep learning library for sequence data [J].
Chen, Kathleen M. ;
Cofer, Evan M. ;
Zhou, Jian ;
Troyanskaya, Olga G. .
NATURE METHODS, 2019, 16 (04) :315-+
[10]   The Encyclopedia of DNA elements (ENCODE): data portal update [J].
Davis, Carrie A. ;
Hitz, Benjamin C. ;
Sloan, Cricket A. ;
Chan, Esther T. ;
Davidson, Jean M. ;
Gabdank, Idan ;
Hilton, Jason A. ;
Jain, Kriti ;
Baymuradov, Ulugbek K. ;
Narayanan, Aditi K. ;
Onate, Kathrina C. ;
Graham, Keenan ;
Miyasato, Stuart R. ;
Dreszer, Timothy R. ;
Strattan, J. Seth ;
Jolanki, Otto ;
Tanaka, Forrest Y. ;
Cherry, J. Michael .
NUCLEIC ACIDS RESEARCH, 2018, 46 (D1) :D794-D801