Network-based features enable prediction of essential genes across diverse organisms

被引：28

作者：

Azhagesan, Karthik ^{[1
,3
,4
]}

Ravindran, Balaraman ^{[2
,3
,4
]}

Raman, Karthik ^{[1
,3
,4
]}

机构：

[1] Indian Inst Technol IIT Madras, Bhupat & Jyoti Mehta Sch Biosci, Dept Biotechnol, Madras 600036, Tamil Nadu, India

[2] IIT Madras, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India

[3] IIT Madras, IBSE, Madras 600036, Tamil Nadu, India

[4] IIT Madras, RBCDSAI, Madras 600036, Tamil Nadu, India

来源：

PLOS ONE | 2018年 / 13卷 / 12期

关键词：

CENTRALITY; DATABASE; UPDATE;

D O I：

10.1371/journal.pone.0208722

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Machine learning approaches to predict essential genes have gained a lot of traction in recent years. These approaches predominantly make use of sequence and network-based features to predict essential genes. However, the scope of network-based features used by the existing approaches is very narrow. Further, many of these studies focus on predicting essential genes within the same organism, which cannot be readily used to predict essential genes across organisms. Therefore, there is clearly a need for a method that is able to predict essential genes across organisms, by leveraging network-based features. In this study, we extract several sets of network-based features from protein-protein association networks available from the STRING database. Our network features include some common measures of centrality, and also some novel recursive measures recently proposed in social network literature. We extract hundreds of network-based features from networks of 27 diverse organisms to predict the essentiality of 87000+ genes. Our results show that network-based features are statistically significantly better at classifying essential genes across diverse bacterial species, compared to the current state-of-the-art methods, which use mostly sequence and a few 'conventional' network-based features. Our diverse set of network properties gave an AUROC of 0.847 and a precision of 0.320 across 27 organisms. When we augmented the complete set of network features with sequence-derived features, we achieved an improved AUROC of 0.857 and a precision of 0.335. We also constructed a reduced set of 100 sequence and network features, which gave a comparable performance. Further, we show that our features are useful for predicting essential genes in new organisms by using leave-one-species-out validation. Our network features capture the local, global and neighbourhood properties of the network and are hence effective for prediction of essential genes across diverse organisms, even in the absence of other complex biological knowledge. Our approach can be readily exploited to predict essentiality for organisms in interactome databases such as the STRING, where both network and sequence are readily available. All codes are available at https://github.com/RamanLab/nbfpeg.

引用

页数：13

共 31 条

[1]

[Anonymous], 2012, P 18 ACM SIGKDD INT

[2]

[Anonymous], 2011, P 17 ACM SIGKDD INT, DOI DOI 10.1145/2020408.2020512

[3]

[Anonymous], 2016, Network Science

[4] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[5] OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines [J].

Chen, Wei-Hua ;

Lu, Guanting ;

Chen, Xiao ;

Zhao, Xing-Ming ;

Bork, Peer .

NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) :D940-D944

[6] A new computational strategy for predicting essential genes [J].

Cheng, Jian ;

Wu, Wenwu ;

Zhang, Yinwen ;

Li, Xiangchen ;

Jiang, Xiaoqian ;

Wei, Gehong ;

Tao, Shiheng .

BMC GENOMICS, 2013, 14

[7]

CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411

[8] Investigating the predictability of essential genes across distantly related organisms using an integrative approach [J].

Deng, Jingyuan ;

Deng, Lei ;

Su, Shengchang ;

Zhang, Minlu ;

Lin, Xiaodong ;

Wei, Lan ;

Minai, Ali A. ;

Hassett, Daniel J. ;

Lu, Long J. .

NUCLEIC ACIDS RESEARCH, 2011, 39 (03) :795-807

[9] STRING v9.1: protein-protein interaction networks, with increased coverage and integration [J].

Franceschini, Andrea ;

Szklarczyk, Damian ;

Frankild, Sune ;

Kuhn, Michael ;

Simonovic, Milan ;

Roth, Alexander ;

Lin, Jianyi ;

Minguez, Pablo ;

Bork, Peer ;

von Mering, Christian ;

Jensen, Lars J. .

NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) :D808-D815

[10] Role Discovery in Graphs using Global Features: Algorithms, Applications and a Novel Evaluation Strategy [J].

Gupte, Pratik Vinay ;

Ravindran, Balaraman ;

Parthasarathy, Srinivasan .

2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, :771-782

← 1 2 3 4 →