Data-driven human transcriptomic module determined by independent component analysis

被引:17
作者
Zhou, Weizhuang [1 ]
Altman, Russ B. [1 ,2 ]
机构
[1] Stanford Univ, Dept Bioengn, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
来源
BMC BIOINFORMATICS | 2018年 / 19卷
关键词
Independent component analysis; Gene expression; Functional modules; Transcriptome; HORNS PARALLEL ANALYSIS; GENE-EXPRESSION DATA; SIGNATURES; NUMBER; DIMENSIONALITY; EXTRACTION; PROGRAMS; MAP;
D O I
10.1186/s12859-018-2338-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features. Results: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p < 0.01). The models also had higher accuracy and negative predictive value (p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches. Conclusions: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently.
引用
收藏
页数:25
相关论文
共 54 条
  • [1] Improved scoring of functional groups from gene expression data by decorrelating GO graph structure
    Alexa, Adrian
    Rahnenfuehrer, Joerg
    Lengauer, Thomas
    [J]. BIOINFORMATICS, 2006, 22 (13) : 1600 - 1607
  • [2] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [3] Importance of collection in gene set enrichment analysis of drug response in cancer cell lines
    Bateman, Alain R.
    El-Hachem, Nehme
    Beck, Andrew H.
    Aerts, Hugo J. W. L.
    Haibe-Kains, Benjamin
    [J]. SCIENTIFIC REPORTS, 2014, 4
  • [4] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [5] Biton A, CELL REPORTS, V9, P1235
  • [6] A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
    Bolstad, BM
    Irizarry, RA
    Åstrand, M
    Speed, TP
    [J]. BIOINFORMATICS, 2003, 19 (02) : 185 - 193
  • [7] AmiGO: online access to ontology and annotation data
    Carbon, Seth
    Ireland, Amelia
    Mungall, Christopher J.
    Shu, ShengQiang
    Marshall, Brad
    Lewis, Suzanna
    [J]. BIOINFORMATICS, 2009, 25 (02) : 288 - 289
  • [8] Exploring the Sensitivity of Horn's Parallel Analysis to the Distributional Form of Random Data
    Dinno, Alexis
    [J]. MULTIVARIATE BEHAVIORAL RESEARCH, 2009, 44 (03) : 362 - 388
  • [9] Knockdown of AKR1C3 exposes a potential epigenetic susceptibility in prostate cancer cells
    Doig, Craig L.
    Battaglia, Sebastiano
    Khanim, Farhat L.
    Bunce, Christopher M.
    Campbell, Moray J.
    [J]. JOURNAL OF STEROID BIOCHEMISTRY AND MOLECULAR BIOLOGY, 2016, 155 : 47 - 55
  • [10] Correction of technical bias in clinical microarray data improves concordance with known biological information
    Eklund, Aron C.
    Szallasi, Zoltan
    [J]. GENOME BIOLOGY, 2008, 9 (02)