Biomarker discovery from high-throughput data by connected network-constrained support vector machine

被引:4
作者
Li, Lingyu [1 ]
Liu, Zhi-Ping [1 ]
机构
[1] Shandong Univ, Sch Control Sci & Engn, Jinan 250061, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
Network-constrained support vector machine; Biomarker discovery; Connectivity; Feature selection; High-throughput data; Breast cancer; NONCONCAVE PENALIZED LIKELIHOOD; VARIABLE SELECTION; GENE-EXPRESSION; R-PACKAGE; CLASSIFICATION; REGRESSION; NUMBER; LASSO;
D O I
10.1016/j.eswa.2023.120179
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
From a systems biology perspective, genes usually work collaboratively in the form of a network, e.g., cancer -related genes participate in an integrative dysfunctional pathway. Thus, feature gene selection considering the graph or network structure plays a crucial role in cancer biomarker discovery from high-throughput omics data. The network-based paradigm demonstrates that integrating gene expression data with gene networks can improve classification performances and generate more interpretable feature subsets. In this paper, we propose an embedded connected network-constrained support vector machine (CNet-SVM) method to keep the selected features in an inherent graph structure in discovering biomarker genes. Firstly, we mathematically formulate the CNet-SVM model as a convex optimization problem constrained by network connectivity inequalities and theoretically investigate the behaviors of all tuning parameters to provide search guidance on the regularization path. Secondly, to check if the genes selected by CNet-SVM could be studied as network-structured biomarkers, we conduct experiments on several simulation datasets and real-world breast cancer (BRCA) datasets to validate its classification and prediction capabilities. The results show that CNet-SVM not only maintains the sparsity and smoothness, but also considers the connectivity constraints between genes when selecting features on a prior gene-gene interaction network from omics data. Especially, CNet-SVM identifies 32 BRCA biomarker genes, which form into a connected network component and can be potentially used for BRCA diagnosis. Furthermore, the comparisons with eight feature selection-empowered SVM methods demonstrate that the easily interpretable networked feature genes discovered by CNet-SVM are more closely related to BRCA dysfunctions. Finally, we validate that the identified biomarkers achieve high prediction accuracy on external independent cohorts. All results proved that the proposed CNet-SVM method is effective in selecting connected-network-structured features and can be an alternative improvement to the current SVM models for biomarker identification from high-throughput data. The data and code are available at https://github.com/zpliulab/CNet-SVM.
引用
收藏
页数:12
相关论文
共 67 条
  • [31] Li X., 2020, PLoS Comput. Biol., V16, P1
  • [32] Rice_Phospho 1.0: a new rice-specific SVM predictor for protein phosphorylation sites
    Lin, Shoukai
    Song, Qi
    Tao, Huan
    Wang, Wei
    Wan, Weifeng
    Huang, Jian
    Xu, Chaoqun
    Chebii, Vivien
    Kitony, Justine
    Que, Shufu
    Harrison, Andrew
    He, Huaqin
    [J]. SCIENTIFIC REPORTS, 2015, 5
  • [33] Identifying disease genes and module biomarkers by differential interactions
    Liu, Xiaoping
    Liu, Zhi-Ping
    Zhao, Xing-Ming
    Chen, Luonan
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (02) : 241 - 248
  • [34] Quantifying Gene Regulatory Relationships with Association Measures: A Comparative Study
    Liu, Zhi-Ping
    [J]. FRONTIERS IN GENETICS, 2017, 8
  • [35] RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse
    Liu, Zhi-Ping
    Wu, Canglin
    Miao, Hongyu
    Wu, Hulin
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2015, : 1 - 12
  • [36] Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
    Love, Michael I.
    Huber, Wolfgang
    Anders, Simon
    [J]. GENOME BIOLOGY, 2014, 15 (12):
  • [37] Supervised group Lasso with applications to microarray data analysis
    Ma, Shuangge
    Song, Xiao
    Huang, Jian
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [38] What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?
    Marcot, Bruce G.
    Hanea, Anca M.
    [J]. COMPUTATIONAL STATISTICS, 2021, 36 (03) : 2009 - 2031
  • [39] The group lasso for logistic regression
    Meier, Lukas
    van de Geer, Sara A.
    Buhlmann, Peter
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 : 53 - 71
  • [40] Circuitry and Dynamics of Human Transcription Factor Regulatory Networks
    Neph, Shane
    Stergachis, Andrew B.
    Reynolds, Alex
    Sandstrom, Richard
    Borenstein, Elhanan
    Stamatoyannopoulos, John A.
    [J]. CELL, 2012, 150 (06) : 1274 - 1286