An introduction to deep learning on biological sequence data: examples and solutions

被引：107

作者：

Jurtz, Vanessa Isabell ^{[1
]}

Johansen, Alexander Rosenberg ^{[2
]}

Nielsen, Morten ^{[1
,3
]}

Armenteros, Jose Juan Almagro ^{[1
]}

Nielsen, Henrik ^{[1
]}

Sonderby, Casper Kaae ^{[4
]}

Winther, Ole ^{[2
,4
]}

Sonderby, Soren Kaae ^{[4
]}

机构：

[1] Tech Univ Denmark, Dept Bio & Hlth Informat, Lyngby, Denmark

[2] Tech Univ Denmark, Dept Appl Math & Comp Sci, Lyngby, Denmark

[3] Univ Nacl San Martin, Inst Invest Biotecnol, Buenos Aires, DF, Argentina

[4] Univ Copenhagen, Dept Biol, Copenhagen, Denmark

来源：

BIOINFORMATICS | 2017年 / 33卷 / 22期

基金：

美国国家卫生研究院;

关键词：

PROTEIN SECONDARY STRUCTURE; PREDICTION; SEGMENTATION;

D O I：

10.1093/bioinformatics/btx531

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology. Here, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules. Availability and implementation: All implementations and datasets are available online to the scientific community at https:// github. com/ vanessajurtz/ lasagne4bio. Contact: skaaesonderby@ gmail. com Supplementary information: Supplementary data are available at Bioinformatics online.

引用

页码：3685 / 3690

页数：6

共 39 条

[1] Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [J].

Alipanahi, Babak ;

Delong, Andrew ;

Weirauch, Matthew T. ;

Frey, Brendan J. .

NATURE BIOTECHNOLOGY, 2015, 33 (08) :831-+

[2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[3] NNAlign: A Web-Based Prediction Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data [J].

Andreatta, Massimo ;

Schafer-Nielsen, Claus ;

Lund, Ole ;

Buus, Soren ;

Nielsen, Morten .

PLOS ONE, 2011, 6 (11)

[4]

[Anonymous], MACHINE LEARNING ENC

[5]

[Anonymous], 2015, ICLR

[6]

[Anonymous], ARXIV E PRINTS

[7]

[Anonymous], 2015, P C NEUR INF PROC SY

[8]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[9]

[Anonymous], 2015, 3 INT C LEARNING REP

[10]

[Anonymous], 2010, P 13 INT C ART INT S

← 1 2 3 4 →