Size matters: how sample size affects the reproducibility and specificity of gene set analysis

被引:32
作者
Maleki, Farhad [1 ]
Ovens, Katie [1 ]
McQuillan, Ian [1 ]
Kusalik, Anthony J. [1 ]
机构
[1] Univ Saskatchewan, Dept Comp Sci, 110 Sci Pl, Saskatoon, SK, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Gene expression; Gene set analysis; Enrichment analysis; Sample size; Specificity; EXPRESSION; ENRICHMENT;
D O I
10.1186/s40246-019-0226-2
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background Gene set analysis is a well-established approach for interpretation of data from high-throughput gene expression studies. Achieving reproducible results is an essential requirement in such studies. One factor of a gene expression experiment that can affect reproducibility is the choice of sample size. However, choosing an appropriate sample size can be difficult, especially because the choice may be method-dependent. Further, sample size choice can have unexpected effects on specificity. Results In this paper, we report on a systematic, quantitative approach to study the effect of sample size on the reproducibility of the results from 13 gene set analysis methods. We also investigate the impact of sample size on the specificity of these methods. Rather than relying on synthetic data, the proposed approach uses real expression datasets to offer an accurate and reliable evaluation. Conclusion Our findings show that, as a general pattern, the results of gene set analysis become more reproducible as sample size increases. However, the extent of reproducibility and the rate at which it increases vary from method to method. In addition, even in the absence of differential expression, some gene set analysis methods report a large number of false positives, and increasing sample size does not lead to reducing these false positives. The results of this research can be used when selecting a gene set analysis method from those available.
引用
收藏
页数:12
相关论文
共 30 条
  • [1] A general modular framework for gene set enrichment analysis
    Ackermann, Marit
    Strimmer, Korbinian
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [2] Bakus G.J., 2007, Quantitative Analysis of Marine Biological Communities, Field Biology and Environment
  • [3] Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1
    Barbie, David A.
    Tamayo, Pablo
    Boehm, Jesse S.
    Kim, So Young
    Moody, Susan E.
    Dunn, Ian F.
    Schinzel, Anna C.
    Sandy, Peter
    Meylan, Etienne
    Scholl, Claudia
    Froehling, Stefan
    Chan, Edmond M.
    Sos, Martin L.
    Michel, Kathrin
    Mermel, Craig
    Silver, Serena J.
    Weir, Barbara A.
    Reiling, Jan H.
    Sheng, Qing
    Gupta, Piyush B.
    Wadlow, Raymond C.
    Le, Hanh
    Hoersch, Sebastian
    Wittner, Ben S.
    Ramaswamy, Sridhar
    Livingston, David M.
    Sabatini, David M.
    Meyerson, Matthew
    Thomas, Roman K.
    Lander, Eric S.
    Mesirov, Jill P.
    Root, David E.
    Gilliland, D. Gary
    Jacks, Tyler
    Hahn, William C.
    [J]. NATURE, 2009, 462 (7269) : 108 - U122
  • [4] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [5] Demmer RT, 2008, J PERIODONTOL, V79, P2112, DOI [10.1902/jop.2008.080139, 10.1902/jop.2008.080139 ]
  • [6] Drghici S, 2016, STAT DATA ANAL MICRO
  • [7] Gene Expression Omnibus: NCBI gene expression and hybridization array data repository
    Edgar, R
    Domrachev, M
    Lash, AE
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 207 - 210
  • [8] ON TESTING THE SIGNIFICANCE OF SETS OF GENES
    Efron, Bradley
    Tibshirani, Robert
    [J]. ANNALS OF APPLIED STATISTICS, 2007, 1 (01) : 107 - 129
  • [9] A global test for groups of genes: testing association with a clinical outcome
    Goeman, JJ
    van de Geer, SA
    de Kort, F
    van Houwelingen, HC
    [J]. BIOINFORMATICS, 2004, 20 (01) : 93 - 99
  • [10] GSVA: gene set variation analysis for microarray and RNA-Seq data
    Haenzelmann, Sonja
    Castelo, Robert
    Guinney, Justin
    [J]. BMC BIOINFORMATICS, 2013, 14