Identifying Individual-Cancer-Related Genes by Rebalancing the Training Samples

被引:15
作者
Chen, Bolin [1 ]
Shang, Xuequn [1 ]
Li, Min [2 ]
Wang, Jianxin [2 ]
Wu, Fang-Xiang [3 ,4 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Cent South Univ, Sch Informat Sci & Engn, Changsha, Hunan, Peoples R China
[3] Nankai Univ, Sch Math Sci, Tianjin 300071, Peoples R China
[4] Univ Saskatchewan, Coll Engn, Saskatoon, SK S7N 5A9, Canada
基金
中国国家自然科学基金; 加拿大自然科学与工程研究理事会;
关键词
Cancer-related gene; imbalanced classification; logistic regression; resampling method; DISEASE GENES; IDENTIFICATION; INTERACTOME; KNOWLEDGE; INFERENCE; NETWORK;
D O I
10.1109/TNB.2016.2553119
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The identification of individual-cancer-related genes typically is an imbalanced classification issue. The number of known cancer-related genes is far less than the number of all unknown genes, which makes it very hard to detect novel predictions from such imbalanced training samples. A regular machine learning method can either only detect genes related to all cancers or add clinical knowledge to circumvent this issue. In this study, we introduce a training sample rebalancing strategy to overcome this issue by using a two-step logistic regression and a random resampling method. The two-step logistic regression is to select a set of genes that related to all cancers. While the random resampling method is performed to further classify those genes associated with individual cancers. The issue of imbalanced classification is circumvented by randomly adding positive instances related to other cancers at first, and then excluding those unrelated predictions according to the overall performance at the following step. Numerical experiments show that the proposed resampling method is able to identify cancer-related genes even when the number of known genes related to it is small. The final predictions for all individual cancers achieve AUC values around 0.93 by using the leave-one-out cross validation method, which is very promising, compared with existing methods.
引用
收藏
页码:309 / 315
页数:7
相关论文
共 30 条
  • [1] Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
  • [2] Bolin Chen, 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), P197, DOI 10.1109/BIBM.2014.6999153
  • [3] Boyd S., 2004, Convex optimization, DOI [10.1017/cbo97805118044 41, 10.1017/CBO9780511804441]
  • [4] Chen BL, 2015, IEEE INT C BIOINFORM, P195, DOI 10.1109/BIBM.2015.7359680
  • [5] A fast and high performance multiple data integration algorithm for identifying human disease genes
    Chen, Bolin
    Li, Min
    Wang, Jianxin
    Shang, Xuequn
    Wu, Fang-Xiang
    [J]. BMC MEDICAL GENOMICS, 2015, 8
  • [6] Chen BL, 2013, IEEE INT C BIOINFORM
  • [7] Disease gene identification by using graph kernels and Markov random fields
    Chen BoLin
    Li Min
    Wang JianXin
    Wu FangXiang
    [J]. SCIENCE CHINA-LIFE SCIENCES, 2014, 57 (11) : 1054 - 1063
  • [8] Identifying disease genes by integrating multiple data sources
    Chen, Bolin
    Wang, Jianxin
    Li, Min
    Wu, Fang-Xiang
    [J]. BMC MEDICAL GENOMICS, 2014, 7
  • [9] Chen YT, 2011, PLOS ONE, V6, DOI [10.1371/journal.pone.0023237, 10.1371/journal.pone.0017876]
  • [10] The human disease network
    Goh, Kwang-Il
    Cusick, Michael E.
    Valle, David
    Childs, Barton
    Vidal, Marc
    Barabasi, Albert-Laszlo
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (21) : 8685 - 8690