Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring

被引:16
作者
Jiang, Xia [1 ]
Jao, Jeremy [1 ]
Neapolitan, Richard [2 ]
机构
[1] Univ Pittsburgh, Dept Biomed Informat, Pittsburgh, PA 15213 USA
[2] Northwestern Univ, Feinberg Sch Med, Dept Prevent Med, Chicago, IL 60611 USA
关键词
GENOME-WIDE ASSOCIATION; EPISTATIC INTERACTIONS; DISEASE; INFERENCE; ALGORITHM; RISK; GENE;
D O I
10.1371/journal.pone.0143247
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects. Methodology/Findings We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer's dataset, we investigated the performance of MBS-IGain. Conclusions/Significance When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer's dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.
引用
收藏
页数:23
相关论文
共 58 条
[1]  
[Anonymous], 2007, CATEGORICAL DATA ANA
[2]  
[Anonymous], 2007, Bayesian networks and decision graphs, DOI DOI 10.1007/978-0-387-68282-2
[3]  
[Anonymous], 2004, Learning Bayesian Networks
[4]   Comparative analysis of methods for detecting interacting loci [J].
Chen, Li ;
Yu, Guoqiang ;
Langefeld, Carl D. ;
Miller, David J. ;
Guy, Richard T. ;
Raghuram, Jayaram ;
Yuan, Xiguo ;
Herrington, David M. ;
Wang, Yue .
BMC GENOMICS, 2011, 12
[5]   Atomic decomposition by basis pursuit [J].
Chen, SSB ;
Donoho, DL ;
Saunders, MA .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 1998, 20 (01) :33-61
[6]  
Chickering M, 1996, LEARNING DATA ARTIFI
[7]   THE COMPUTATIONAL-COMPLEXITY OF PROBABILISTIC INFERENCE USING BAYESIAN BELIEF NETWORKS [J].
COOPER, GF .
ARTIFICIAL INTELLIGENCE, 1990, 42 (2-3) :393-405
[8]  
COOPER GF, 1992, MACH LEARN, V9, P309, DOI 10.1007/BF00994110
[9]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[10]   Optimizing exact genetic linkage computations [J].
Fishelson, M ;
Geiger, D .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2004, 11 (2-3) :263-275