A fast and automated solution for accurately resolving protein domain architectures

被引:26
作者
Yeats, Corin [1 ]
Redfern, Oliver C. [1 ]
Orengo, Christine [1 ]
机构
[1] UCL, Dept Struct & Mol Biol, London WC1E 6BT, England
基金
美国国家卫生研究院;
关键词
DATABASE; RECOGNITION; SEQUENCE; FAMILIES; GENOMES;
D O I
10.1093/bioinformatics/btq034
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Accurate prediction of the domain content and arrangement in multi-domain proteins (which make up >65% of the large-scale protein databases) provides a valuable tool for function prediction, comparative genomics and studies of molecular evolution. However, scanning a multi-domain protein against a database of domain sequence profiles can often produce conflicting and overlapping matches. We have developed a novel method that employs heaviest weighted clique-finding (HCF), which we show significantly outperforms standard published approaches based on successively assigning the best non-overlapping match (Best Match Cascade, BMC). Results: We created benchmark data set of structural domain assignments in the CATH database and a corresponding set of Hidden Markov Model-based domain predictions. Using these, we demonstrate that by considering all possible combinations of matches using the HCF approach, we achieve much higher prediction accuracy than the standard BMC method. We also show that it is essential to allow overlapping domain matches to a query in order to identify correct domain assignments. Furthermore, we introduce a straightforward and effective protocol for resolving any overlapping assignments, and producing a single set of non-overlapping predicted domains.
引用
收藏
页码:745 / 751
页数:7
相关论文
共 19 条
[1]   Data growth and its impact on the SCOP database: new developments [J].
Andreeva, Antonina ;
Howorth, Dave ;
Chandonia, John-Marc ;
Brenner, Steven E. ;
Hubbard, Tim J. P. ;
Chothia, Cyrus ;
Murzin, Alexey G. .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D419-D425
[2]   The Universal Protein Resource (UniProt) 2009 [J].
Bairoch, Amos ;
Consortium, UniProt ;
Bougueleret, Lydie ;
Altairac, Severine ;
Amendolia, Valeria ;
Auchincloss, Andrea ;
Argoud-Puy, Ghislaine ;
Axelsen, Kristian ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte ;
Bolleman, Jerven ;
Bollondi, Laurent ;
Boutet, Emmanuel ;
Quintaje, Silvia Braconi ;
Breuza, Lionel ;
Bridge, Alan ;
deCastro, Edouard ;
Ciapina, Luciane ;
Coral, Danielle ;
Coudert, Elisabeth ;
Cusin, Isabelle ;
Delbard, Gwennaelle ;
Dornevil, Dolnide ;
Roggli, Paula Duek ;
Duvaud, Severine ;
Estreicher, Anne ;
Famiglietti, Livia ;
Feuermann, Marc ;
Gehant, Sebastian ;
Farriol-Mathis, Nathalie ;
Ferro, Serenella ;
Gasteiger, Elisabeth ;
Gateau, Alain ;
Gerritsen, Vivienne ;
Gos, Arnaud ;
Gruaz-Gumowski, Nadine ;
Hinz, Ursula ;
Hulo, Chantal ;
Hulo, Nicolas ;
James, Janet ;
Jimenez, Silvia ;
Jungo, Florence ;
Junker, Vivien ;
Kappler, Thomas ;
Keller, Guillaume ;
Lachaize, Corinne ;
Lane-Guermonprez, Lydie ;
Langendijk-Genevaux, Petra ;
Lara, Vicente .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D169-D174
[3]   Announcing the worldwide Protein Data Bank [J].
Berman, H ;
Henrick, K ;
Nakamura, H .
NATURE STRUCTURAL BIOLOGY, 2003, 10 (12) :980-980
[4]   The CATH classification revisited-architectures reviewed and new ways to characterize structural divergence in superfamilies [J].
Cuff, Alison L. ;
Sillitoe, Ian ;
Lewis, Tony ;
Redfern, Oliver C. ;
Garratt, Richard ;
Thornton, Janet ;
Orengo, Christine A. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D310-D314
[5]   Multi-domain proteins in the three kingdoms of life:: Orphan domains and other unassigned regions [J].
Ekman, D ;
Björklund, ÅK ;
Frey-Skött, J ;
Elofsson, A .
JOURNAL OF MOLECULAR BIOLOGY, 2005, 348 (01) :231-243
[6]   The Pfam protein families database [J].
Finn, Robert D. ;
Tate, John ;
Mistry, Jaina ;
Coggill, Penny C. ;
Sammut, Stephen John ;
Hotz, Hans-Rudolf ;
Ceric, Goran ;
Forslund, Kristoffer ;
Eddy, Sean R. ;
Sonnhammer, Erik L. L. ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D281-D288
[7]   Exhaustive enumeration of protein domain families [J].
Heger, A ;
Holm, L .
JOURNAL OF MOLECULAR BIOLOGY, 2003, 328 (03) :749-767
[8]   Ensembl 2009 [J].
Hubbard, T. J. P. ;
Aken, B. L. ;
Ayling, S. ;
Ballester, B. ;
Beal, K. ;
Bragin, E. ;
Brent, S. ;
Chen, Y. ;
Clapham, P. ;
Clarke, L. ;
Coates, G. ;
Fairley, S. ;
Fitzgerald, S. ;
Fernandez-Banet, J. ;
Gordon, L. ;
Graf, S. ;
Haider, S. ;
Hammond, M. ;
Holland, R. ;
Howe, K. ;
Jenkinson, A. ;
Johnson, N. ;
Kahari, A. ;
Keefe, D. ;
Keenan, S. ;
Kinsella, R. ;
Kokocinski, F. ;
Kulesha, E. ;
Lawson, D. ;
Longden, I. ;
Megy, K. ;
Meidl, P. ;
Overduin, B. ;
Parker, A. ;
Pritchard, B. ;
Rios, D. ;
Schuster, M. ;
Slater, G. ;
Smedley, D. ;
Spooner, W. ;
Spudich, G. ;
Trevanion, S. ;
Vilella, A. ;
Vogel, J. ;
White, S. ;
Wilder, S. ;
Zadissa, A. ;
Birney, E. ;
Cunningham, F. ;
Curwen, V. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D690-D697
[9]   InterPro: the integrative protein signature database [J].
Hunter, Sarah ;
Apweiler, Rolf ;
Attwood, Teresa K. ;
Bairoch, Amos ;
Bateman, Alex ;
Binns, David ;
Bork, Peer ;
Das, Ujjwal ;
Daugherty, Louise ;
Duquenne, Lauranne ;
Finn, Robert D. ;
Gough, Julian ;
Haft, Daniel ;
Hulo, Nicolas ;
Kahn, Daniel ;
Kelly, Elizabeth ;
Laugraud, Aurelie ;
Letunic, Ivica ;
Lonsdale, David ;
Lopez, Rodrigo ;
Madera, Martin ;
Maslen, John ;
McAnulla, Craig ;
McDowall, Jennifer ;
Mistry, Jaina ;
Mitchell, Alex ;
Mulder, Nicola ;
Natale, Darren ;
Orengo, Christine ;
Quinn, Antony F. ;
Selengut, Jeremy D. ;
Sigrist, Christian J. A. ;
Thimma, Manjula ;
Thomas, Paul D. ;
Valentin, Franck ;
Wilson, Derek ;
Wu, Cathy H. ;
Yeats, Corin .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D211-D215
[10]   Combining local-structure, fold-recognition, and new fold methods for protein structure prediction [J].
Karplus, K ;
Karchin, R ;
Draper, J ;
Casper, J ;
Mandel-Gutfreund, Y ;
Diekhans, M ;
Hughey, R .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2003, 53 :491-496