Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

被引：7

作者：

Ma, QC ^{[1
]}

Chirn, GW ^{[1
]}

Cai, R ^{[1
]}

Szustakowski, JD ^{[1
]}

Nirmala, NR ^{[1
]}

机构：

[1] Novartis Inst Biomed Res Inc, Genome & Proteome Sci, Biomed Comp, Cambridge, MA 02139 USA

来源：

BMC BIOINFORMATICS | 2005年 / 6卷 / 1期

关键词：

D O I：

10.1186/1471-2105-6-242

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. Results: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. Conclusion: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.

引用

页数：13

共 27 条

[1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].

Altschul, SF ;

Madden, TL ;

Schaffer, AA ;

Zhang, JH ;

Zhang, Z ;

Miller, W ;

Lipman, DJ .

NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402

[2]

Bishop C. M., 1996, Neural networks for pattern recognition

[3] The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].

Boeckmann, B ;

Bairoch, A ;

Apweiler, R ;

Blatter, MC ;

Estreicher, A ;

Gasteiger, E ;

Martin, MJ ;

Michoud, K ;

O'Donovan, C ;

Phan, I ;

Pilbout, S ;

Schneider, M .

NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370

[4] Clustering protein sequences-structure prediction by transitive homology [J].

Bolten, E ;

Schliep, A ;

Schneckener, S ;

Schomburg, D ;

Schrader, R .

BIOINFORMATICS, 2001, 17 (10) :935-941

[5] Comparative mycobacterial genomics as a tool for drug target and antigen discovery [J].

Cole, ST .

EUROPEAN RESPIRATORY JOURNAL, 2002, 20 :78S-86S

[6] An efficient algorithm for large-scale detection of protein families [J].

Enright, AJ ;

Van Dongen, S ;

Ouzounis, CA .

NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584

[7]

Friedman J., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5

[8] SnapDRAGON: a method to delineate protein structural domains from sequence data [J].

George, RA ;

Heringa, J .

JOURNAL OF MOLECULAR BIOLOGY, 2002, 316 (03) :839-851

[9] Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence [J].

Gerstein, M .

BIOINFORMATICS, 1998, 14 (08) :707-714

[10] Whole genome protein domain analysis using a new method for domain clustering [J].

Gouzy, J ;

Corpet, F ;

Kahn, D .

COMPUTERS & CHEMISTRY, 1999, 23 (3-4) :333-340

← 1 2 3 →