Classification of Chromosomal DNA Sequences Using Hybrid Deep Learning Architectures

被引:10
作者
Du, Zhihua [1 ]
Xiao, Xiangdong [1 ]
Uversky, Vladimir N. [2 ,3 ,4 ]
机构
[1] Shenzhen Univ, Guangdong Lab Artificial Intelligence & Digital E, Shenzhen, Peoples R China
[2] Univ S Florida, Morsani Coll Med, Dept Mol Med, 12901 Bruce B Downs Blvd MDC07, Tampa, FL 33620 USA
[3] Univ S Florida, Morsani Coll Med, USF Hlth Byrd Alzheimer Res Inst, 12901 Bruce B Downs Blvd MDC07, Tampa, FL 33620 USA
[4] Russian Acad Sci, Inst Biol Instrumentat, Lab New Methods Biol, Inst Skaya Str 7, Pushchino 142290, Moscow Region, Russia
关键词
Convolutional neural network (CNN); long short-term memory network (LSTM); DNA sequence classification; eukaryotes; chromosomes; hybrid; PROTEASE CLEAVAGE SITES; PREDICTION; GENOME;
D O I
10.2174/1574893615666200224095531
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Chromosomal DNA contains most of the genetic information of eukaryotes and plays an important role in the growth, development and reproduction of living organisms. Most chromosomal DNA sequences are known to wrap around histones, and distinguishing these DNA sequences from ordinary DNA sequences is important for understanding the genetic code of life. The main difficulty behind this problem is the feature selection process. DNA sequences have no explicit features, and the common representation methods, such as onehot coding, introduced the major drawback of high dimensionality. Recently, deep learning models have been proved to be able to automatically extract useful features from input patterns. Objective: We aim to investigate which deep learning networks could achieve notable improvements in the field of DNA sequence classification using only sequence information. Methods: In this paper, we present four different deep learning architectures using convolutional neural networks and long short-term memory networks for the purpose of chromosomal DNA sequence classification. Natural language model Word2vec was used to generate word embedding of sequence and learn features from it by deep learning. Results: The comparison of these four architectures is carried out on 10 chromosomal DNA datasets. The results show that the architecture of convolutional neural networks combined with long short-term memory networks is superior to other methods with regards to the accuracy of chromosomal DNA prediction. Conclusion: In this study, four deep learning models were compared for an automatic classification of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have regarded DNA sequences as natural language and extracted word embedding with Word2Vec to represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten classification tasks. The reason for this success is that the CNN module captures the regulatory motifs, while the following LSTM layer captures the long-term dependencies between them.
引用
收藏
页码:1130 / 1136
页数:7
相关论文
共 37 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]  
[Anonymous], 2009, LEARNING DEEP ARCHIT
[3]  
[Anonymous], 2015, Deep learn. nat., DOI [10.1038/nature14539, DOI 10.1038/NATURE14539]
[4]   Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [J].
Asgari, Ehsaneddin ;
Mofrad, Mohammad R. K. .
PLOS ONE, 2015, 10 (11)
[5]   A neural probabilistic language model [J].
Bengio, Y ;
Ducharme, R ;
Vincent, P ;
Jauvin, C .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155
[6]  
Bosco GL, 2015, INT M COMP INT METH, DOI 10.1007/3-540-63246-8_21
[7]  
Bosco GL, 2014, INT M COMP INT METH
[8]  
Chou KC, 1996, PROTEINS, V24, P51, DOI 10.1002/(SICI)1097-0134(199601)24:1<51::AID-PROT4>3.0.CO
[9]  
2-R
[10]   Prediction of human immunodeficiency virus protease cleavage sites in proteins [J].
Chou, KC .
ANALYTICAL BIOCHEMISTRY, 1996, 233 (01) :1-14