Identifying Individual-Cancer-Related Genes by Rebalancing the Training Samples

被引：15

作者：

Chen, Bolin ^{[1
]}

Shang, Xuequn ^{[1
]}

Li, Min ^{[2
]}

Wang, Jianxin ^{[2
]}

Wu, Fang-Xiang ^{[3
,4
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China

[2] Cent South Univ, Sch Informat Sci & Engn, Changsha, Hunan, Peoples R China

[3] Nankai Univ, Sch Math Sci, Tianjin 300071, Peoples R China

[4] Univ Saskatchewan, Coll Engn, Saskatoon, SK S7N 5A9, Canada

来源：

IEEE TRANSACTIONS ON NANOBIOSCIENCE | 2016年 / 15卷 / 04期

基金：

中国国家自然科学基金; 加拿大自然科学与工程研究理事会;

关键词：

Cancer-related gene; imbalanced classification; logistic regression; resampling method; DISEASE GENES; IDENTIFICATION; INTERACTOME; KNOWLEDGE; INFERENCE; NETWORK;

D O I：

10.1109/TNB.2016.2553119

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

The identification of individual-cancer-related genes typically is an imbalanced classification issue. The number of known cancer-related genes is far less than the number of all unknown genes, which makes it very hard to detect novel predictions from such imbalanced training samples. A regular machine learning method can either only detect genes related to all cancers or add clinical knowledge to circumvent this issue. In this study, we introduce a training sample rebalancing strategy to overcome this issue by using a two-step logistic regression and a random resampling method. The two-step logistic regression is to select a set of genes that related to all cancers. While the random resampling method is performed to further classify those genes associated with individual cancers. The issue of imbalanced classification is circumvented by randomly adding positive instances related to other cancers at first, and then excluding those unrelated predictions according to the overall performance at the following step. Numerical experiments show that the proposed resampling method is able to identify cancer-related genes even when the number of known genes related to it is small. The final predictions for all individual cancers achieve AUC values around 0.93 by using the leave-one-out cross validation method, which is very promising, compared with existing methods.

引用

页码：309 / 315

页数：7

共 30 条

[1] Bishop C.M., 2006, PATTERN RECOGN, V4, P738, DOI DOI 10.1117/1.2819119
[2] Bolin Chen, 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), P197, DOI 10.1109/BIBM.2014.6999153
[3] Boyd S., 2004, Convex optimization, DOI [10.1017/cbo97805118044 41, 10.1017/CBO9780511804441]
[4] Chen BL, 2015, IEEE INT C BIOINFORM, P195, DOI 10.1109/BIBM.2015.7359680
[5] A fast and high performance multiple data integration algorithm for identifying human disease genes
Chen, Bolin
Li, Min
Wang, Jianxin
Shang, Xuequn
Wu, Fang-Xiang
[J]. BMC MEDICAL GENOMICS, 2015, 8
[6] Chen BL, 2013, IEEE INT C BIOINFORM
[7] Disease gene identification by using graph kernels and Markov random fields
Chen BoLin
Li Min
Wang JianXin
Wu FangXiang
[J]. SCIENCE CHINA-LIFE SCIENCES, 2014, 57 (11) : 1054 - 1063
[8] Identifying disease genes by integrating multiple data sources
Chen, Bolin
Wang, Jianxin
Li, Min
Wu, Fang-Xiang
[J]. BMC MEDICAL GENOMICS, 2014, 7
[9] Chen YT, 2011, PLOS ONE, V6, DOI [10.1371/journal.pone.0023237, 10.1371/journal.pone.0017876]
[10] The human disease network
Goh, Kwang-Il
Cusick, Michael E.
Valle, David
Childs, Barton
Vidal, Marc
Barabasi, Albert-Laszlo
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (21) : 8685 - 8690

← 1 2 3 →