LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

被引:0
作者
Jernigan, Robert [1 ]
Jia, Kejue [1 ]
Ren, Zhao [2 ]
Zhou, Wen [3 ]
机构
[1] Iowa State Univ, Program Bioinformat & Computat Biol, Dept Biochem Biophys & Mol Biol, Ames, IA 50011 USA
[2] Univ Pittsburgh, Dept Stat, Pittsburgh, PA 15260 USA
[3] Colorado State Univ, Dept Stat, Ft Collins, CO 80523 USA
关键词
Collective dependence; false discovery rate; information theoretic measure; multiple testing; protein coevolution; structural biology; GAUSSIAN GRAPHICAL MODEL; FALSE DISCOVERY RATE; GENE-EXPRESSION; INFORMATION; ENTROPY; COEVOLUTION; VARIABILITY; NETWORK;
D O I
10.1214/20-AOAS1431
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Measuring the dependence of k >= 3 random variables and drawing inference from such higher-order dependences are scientifically important yet challenging. Motivated here by protein coevolution with multivariate categorical features, we consider an information theoretic measure of higher-order dependence. The proposed collective dependence is a symmetrization of differential interaction information which generalizes the mutual information of a pair of random variables. We show that the collective dependence can be easily estimated and facilitates a test on the dependence of k >= 3 random variables. Upon carefully exploring the null space of collective dependence, we devise a Classification-Assisted Large scaLe inference procedure to DEtect significant k-COllective DEpendence among d >= k random variables, with the false discovery rate controlled. Finite sample performance of our method is examined via simulations. We apply this method to the multiple protein sequence alignment data to study the residue or position coevolution for two protein families, the elongation factor P family and the zinc knuckle family. We identify novel functional triplets of amino acid residues, whose contributions to the protein function are further investigated. These confirm that the collective dependence does yield additional information important for understanding the protein coevolution compared to the pairwise measures.
引用
收藏
页码:902 / 924
页数:23
相关论文
共 65 条
[41]  
Miller G. A., 1955, INFORMATION THEORY B, P95
[42]   Protein dynamic communities from elastic network models align closely to the communities defined by molecular dynamics [J].
Mishra, Sambit Kumar ;
Jernigan, Robert L. .
PLOS ONE, 2018, 13 (06)
[43]   Direct-coupling analysis of residue coevolution captures native contacts across many protein families [J].
Morcos, Faruck ;
Pagnani, Andrea ;
Lunt, Bryan ;
Bertolino, Arianna ;
Marks, Debora S. ;
Sander, Chris ;
Zecchina, Riccardo ;
Onuchic, Jose N. ;
Hwa, Terence ;
Weigt, Martin .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (49) :E1293-E1301
[44]  
Nelson D. L., 2005, PRINCIPLES BIOCH, V4th
[45]   Practical aspects of protein co-evolution [J].
Ochoa, David ;
Pazos, Florencio .
FRONTIERS IN CELL AND DEVELOPMENTAL BIOLOGY, 2014, 2
[46]   A Maximum Entropy Test for Evaluating Higher-Order Correlations in Spike Counts [J].
Onken, Arno ;
Dragoi, Valentin ;
Obermayer, Klaus .
PLOS COMPUTATIONAL BIOLOGY, 2012, 8 (06)
[47]   Estimation of entropy and mutual information [J].
Paninski, L .
NEURAL COMPUTATION, 2003, 15 (06) :1191-1253
[48]   Hot Spots for Allosteric Regulation on Protein Surfaces [J].
Reynolds, Kimberly A. ;
McLaughlin, Richard N. ;
Ranganathan, Rama .
CELL, 2011, 147 (07) :1564-1575
[49]   Linkage disequilibrium - understanding the evolutionary past and mapping the medical future [J].
Slatkin, Montgomery .
NATURE REVIEWS GENETICS, 2008, 9 (06) :477-485
[50]   Higher-order correlations in non-stationary parallel spike trains: statistical modeling and inference [J].
Staude, Benjamin ;
Gruen, Sonja ;
Rotter, Stefan .
FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2010, 4