The effect of non-linear signal in classification problems using gene expression

被引:4
作者
Heil, Benjamin [1 ]
Crawford, Jake [1 ]
Greene, Casey [2 ,3 ]
机构
[1] Univ Penn, Perelman Sch Med, Genom & Computat Biol Grad Grp, Philadelphia, PA USA
[2] Univ Colorado, Dept Pharmacol, Sch Med, Boulder, CO 80309 USA
[3] Univ Colorado, Dept Biochem & Mol Genet, Sch Med, Boulder, CO 80309 USA
基金
美国国家卫生研究院;
关键词
Genes expression - High dimensionality - Linear signals - Logistics regressions - Multi-layer neural networks - Neural-networks - Non linear - Non-linear modelling - Predictive models - Transcriptomics;
D O I
10.1371/journal.pcbi.1010984
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Author summaryIf we could consistently predict biological conditions from mRNA levels, it could help discover biomarkers for disease diagnosis. Deep learning has become widely used for many tasks including biomarker discovery. It is unclear whether the complexity of these models is helpful. We evaluate whether or not more complex non-linear models have an advantage over simpler linear ones for a set of prediction tasks. We find that, at least for tissue prediction and prediction of metadata-derived sex prediction, linear models perform just as well as non-linear ones. However, we also demonstrate the presence of a predictive signal in the data that only the non-linear models can use. Our results suggest that the non-linear signals may be redundant with linear ones or that current deep neural networks are not able to successfully use the signal when linear signals are present. Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal when predicting tissue and metadata sex labels from expression data by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are high-dimensional, effective dividing lines for predictive models may not be.
引用
收藏
页数:12
相关论文
共 35 条
  • [1] [Anonymous], 2020, NEPT EXP MAN COLL TO
  • [2] DeePathology: Deep Multi-Task Learning for Inferring Molecular Pathology from Cancer Transcriptome
    Azarkhalili, Behrooz
    Saberi, Ali
    Chitsaz, Hamidreza
    Sharifi-Zarchi, Ali
    [J]. SCIENTIFIC REPORTS, 2019, 9 (1)
  • [3] A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models
    Christodoulou, Evangelia
    Ma, Jie
    Collins, Gary S.
    Steyerberg, Ewout W.
    Verbakel, Jan Y.
    Van Calster, Ben
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2019, 110 : 12 - 22
  • [4] Large-scale labeling and assessment of sex bias in publicly available expression data
    Flynn, Emily
    Chang, Annie
    Altman, Russ B.
    [J]. BMC BIOINFORMATICS, 2021, 22 (01)
  • [5] The evolution of gene expression and the transcriptome-phenotype relationship
    Harrison, Peter W.
    Wright, Alison E.
    Mank, Judith E.
    [J]. SEMINARS IN CELL & DEVELOPMENTAL BIOLOGY, 2012, 23 (02) : 222 - 229
  • [6] Reproducibility standards for machine learning in the life sciences
    Heil, Benjamin J.
    Hoffman, Michael M.
    Markowetz, Florian
    Lee, Su-In
    Greene, Casey S.
    Hicks, Stephanie C.
    [J]. NATURE METHODS, 2021, 18 (10) : 1132 - 1135
  • [7] Hu QW, 2019, BIOCOMPUT-PAC SYM, P362
  • [8] Ioffe Sergey, 2015, International conference on machine learning, V37, P448, DOI DOI 10.48550/ARXIV.1502.03167
  • [9] A biological network-based regularized artificial neural network model for robust phenotype prediction from gene expression data
    Kang, Tianyu
    Ding, Wei
    Zhang, Luoyan
    Ziemek, Daniel
    Zarringhalam, Kourosh
    [J]. BMC BIOINFORMATICS, 2017, 18
  • [10] Kingma DP., 2014, ARXIV, DOI DOI 10.48550/ARXIV.1412.6980