Classification of RNA-Seq data via Gaussian copulas

被引:1
作者
Zhang, Qingyang [1 ]
机构
[1] Univ Arkansas, Dept Math Sci, Fayetteville, AR 72701 USA
来源
STAT | 2017年 / 6卷 / 01期
关键词
correlated counts; Gaussian copula; negative binomial distribution; RNA-Seq; sample classification; DISPERSION;
D O I
10.1002/sta4.144
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
RNA-sequencing (RNA-Seq) has become a preferred option to quantify gene expression, because it is more accurate and reliable than microarrays. In RNA-Seq experiments, the expression level of a gene is measured by the count of short reads that are mapped to the gene region. Although some normal-based statistical methods may also be applied to log-transformed read counts, they are not ideal for directly modelling RNA-Seq data. Two discrete distributions, Poisson distribution and negative binomial distribution, have been commonly used in the literature to model RNA-Seq data, where the latter is a natural extension of the former with allowance of overdispersion. Because of the technical difficulty in modelling correlated counts, most existing classifiers based on discrete distributions assume that genes are independent of each other. However, as we show in this paper, the independence assumption may cause non-ignorable bias in estimating the discriminant score, making the classification inaccurate. To this end, we drop the independence assumption and explicitly model the dependence between genes using a Gaussian copula. We apply a Bayesian approach to estimate the covariance matrix and the overdispersion parameter in negative binomial distribution. Both synthetic data and real data are used to demonstrate the advantages of our model. Copyright (C) 2017 John Wiley & Sons, Ltd.
引用
收藏
页码:171 / 183
页数:13
相关论文
共 19 条
[1]   NBLDA: negative binomial linear discriminant analysis for RNA-Seq data [J].
Dong, Kai ;
Zhao, Hongyu ;
Tong, Tiejun ;
Wan, Xiang .
BMC BIOINFORMATICS, 2016, 17
[2]  
Hardcastle TJ, 2014, BMC BIOINFORMATICS, V11, P1
[3]  
Lee EH, 2010, ADV ECONOMETRICS, V34, P325
[4]   Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data [J].
Li, Jun ;
Tibshirani, Robert .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2013, 22 (05) :519-536
[5]   A new algorithm for simulating a correlation matrix based on parameter expansion and reparameterization [J].
Liu, Xuefeng ;
Daniels, Michael J. .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (04) :897-914
[6]   Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 [J].
Love, Michael I. ;
Huber, Wolfgang ;
Anders, Simon .
GENOME BIOLOGY, 2014, 15 (12)
[7]  
Mardis ER, 2008, ANNU REV GENOM HUM G, V17, P1
[8]   RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays [J].
Marioni, John C. ;
Mason, Christopher E. ;
Mane, Shrikant M. ;
Stephens, Matthew ;
Gilad, Yoav .
GENOME RESEARCH, 2008, 18 (09) :1509-1517
[9]   Transcriptome genetics using second generation sequencing in a Caucasian population [J].
Montgomery, Stephen B. ;
Sammeth, Micha ;
Gutierrez-Arcelus, Maria ;
Lach, Radoslaw P. ;
Ingle, Catherine ;
Nisbett, James ;
Guigo, Roderic ;
Dermitzakis, Emmanouil T. .
NATURE, 2010, 464 (7289) :773-U151
[10]  
Nelson R., 1999, INTRO COPULAS