GenomeTools: A Comprehensive Software Library for Efficient Processing of Structured Genome Annotations

被引:299
作者
Gremme, Gordon [1 ]
Steinbiss, Sascha [1 ]
Kurtz, Stefan [1 ]
机构
[1] Univ Hamburg, Ctr Bioinformat, D-20146 Hamburg, Germany
关键词
Scientific computing; biology and genetics; software engineering; reusable libraries; programming environments; TOOL; BIOINFORMATICS; FORMAT;
D O I
10.1109/TCBB.2013.68
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genome annotations are often published as plain text files describing genomic,features and their subcomponents by an implicit annotation graph. In this paper, we present the GenomeTools, a convenient and efficient software library and associated software tools for developing bioinformatics software intended to create, process or convert annotation graphs. The GenomeTools strictly follow the annotation graph approach, offering a unified graph-based representation. This gives the developer intuitive and immediate access to genomic features and tools for their manipulation. To process large annotation sets with low memory overhead, we have designed. and implemented an efficient pull-based approach for sequential processing of annotations. This allows to handle even the largest annotation sets, such as a complete catalogue of human variations. Our object-oriented C-based software library enables a developer to conveniently implement their own functionality on annotation graphs and to integrate it into larger workflows, simultaneously accessing compressed sequence data if required. The careful C implementation of the GenomeTools does not only ensure a light-weight memory footprint while allowing full sequential as well as random access to the annotation graph, but also facilitates the creation of bindings to a variety of script programming languages (like Python and Ruby) sharing the same interface.
引用
收藏
页码:645 / 656
页数:12
相关论文
共 46 条
[1]   CASSys: an integrated software-system for the interactive analysis of ChIP-seq data [J].
Alawi, Malik ;
Kurtz, Stefan ;
Beckstette, Michael .
JOURNAL OF INTEGRATIVE BIOINFORMATICS, 2011, 8 (02)
[2]   Nested containment list (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases [J].
Alekseyenko, Alexander V. ;
Lee, Christopher J. .
BIOINFORMATICS, 2007, 23 (11) :1386-1393
[3]  
[Anonymous], 1990, Introduction to Algorithms
[4]  
[Anonymous], 2013, GTF2 2 GENE ANNOTATI
[5]   Pybedtools: a flexible Python']Python library for manipulating genomic datasets and annotations [J].
Dale, Ryan K. ;
Pedersen, Brent S. ;
Quinlan, Aaron R. .
BIOINFORMATICS, 2011, 27 (24) :3423-3424
[6]   The variant call format and VCFtools [J].
Danecek, Petr ;
Auton, Adam ;
Abecasis, Goncalo ;
Albers, Cornelis A. ;
Banks, Eric ;
DePristo, Mark A. ;
Handsaker, Robert E. ;
Lunter, Gerton ;
Marth, Gabor T. ;
Sherry, Stephen T. ;
McVean, Gilean ;
Durbin, Richard .
BIOINFORMATICS, 2011, 27 (15) :2156-2158
[7]  
Day Richter J, 2013, OBO FLAT FILE FORMAT
[8]   SeqAn An efficient, generic C++ library for sequence analysis [J].
Doering, Andreas ;
Weese, David ;
Rausch, Tobias ;
Reinert, Knut .
BMC BIOINFORMATICS, 2008, 9 (1)
[9]   The Distributed Annotation System [J].
Dowell, Robin D. ;
Jokerst, Rodney M. ;
Day, Allen ;
Eddy, Sean R. ;
Stein, Lincoln .
BMC BIOINFORMATICS, 2001, 2 (1)
[10]  
Durbin R., 2013, GFF GEN FEATURE FORM