Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

被引:20
作者
Barton, Sheila J. [1 ]
Crozier, Sarah R. [1 ]
Lillycrop, Karen A. [4 ,5 ]
Godfrey, Keith M. [1 ,2 ,3 ,4 ]
Inskip, Hazel M. [1 ]
机构
[1] Univ Southampton, MRC Lifecourse Epidemiol Unit, Southampton, Hants, England
[2] Univ Southampton, NIHR Southampton Biomed Res Ctr, Southampton, Hants, England
[3] Univ Hosp Southampton NHS Fdn Trust, Southampton, Hants, England
[4] Univ Southampton, Human Dev & Hlth Acad Unit, Southampton, Hants, England
[5] Univ Southampton, Sch Biol Sci, Southampton, Hants, England
基金
英国医学研究理事会;
关键词
P values; Distributions; Statistical analysis; Statistical assumptions; Whole genome methylation promoter arrays; Epigenome; HETEROSCEDASTICITY; EXPRESSION;
D O I
10.1186/1471-2164-14-161
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Statistical analysis of genome-wide microarrays can result in many thousands of identical statistical tests being performed as each probe is tested for an association with a phenotype of interest. If there were no association between any of the probes and the phenotype, the distribution of P values obtained from statistical tests would resemble a Uniform distribution. If a selection of probes were significantly associated with the phenotype we would expect to observe P values for these probes of less than the designated significance level, alpha, resulting in more P values of less than alpha than expected by chance. Results: In data from a whole genome methylation promoter array we unexpectedly observed P value distributions where there were fewer P values less than alpha than would be expected by chance. Our data suggest that a possible reason for this is a violation of the statistical assumptions required for these tests arising from heteroskedasticity. A simple but statistically sound remedy (a heteroskedasticity-consistent covariance matrix estimator to calculate standard errors of regression coefficients that are robust to heteroskedasticity) rectified this violation and resulted in meaningful P value distributions. Conclusions: The statistical analysis of 'omics data requires careful handling, especially in the choice of statistical test. To obtain meaningful results it is essential that the assumptions behind these tests are carefully examined and any violations rectified where possible, or a more appropriate statistical test chosen.
引用
收藏
页数:9
相关论文
共 14 条
[1]  
[Anonymous], R LANG ENV STAT COMP
[2]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[3]  
Choe SE, 2006, GENOME BIOL, V7, P401
[4]   DIAGNOSTICS FOR HETEROSCEDASTICITY IN REGRESSION [J].
COOK, RD ;
WEISBERG, S .
BIOMETRIKA, 1983, 70 (01) :1-10
[5]   A reanalysis of a published Affymetrix GeneChip control dataset [J].
Dabney, AR ;
Storey, JD .
GENOME BIOLOGY, 2006, 7 (03)
[6]   Towards the uniform distribution of null p-values on Affymetrix microarrays. [J].
Fodor, Anthony A. ;
Tickle, Timothy L. ;
Richardson, Christine .
GENOME BIOLOGY, 2007, 8 (05)
[7]   Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent [J].
Gaile, Daniel P. ;
Miecznikowski, Jeffrey C. .
BMC GENOMICS, 2007, 8 (1)
[8]  
Glass GV., 1996, Statistical methods in education and psychology
[9]  
Gujarati D.N., 2003, BASIC ECONOMETRICS, V4th
[10]   Wnt-1 is dominant over Neu in specifying mammary tumor expression profiles [J].
Huang, Shixia ;
Podsypanina, Katrina ;
Chen, Yidong ;
Cai, Weiyan ;
Tsimelzon, Anna ;
Hilsenbeek, Susan ;
Li, Yi .
TECHNOLOGY IN CANCER RESEARCH & TREATMENT, 2006, 5 (06) :565-571