Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence

被引:41
作者
Gerstein, M [1 ]
机构
[1] Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA
关键词
D O I
10.1093/bioinformatics/14.8.707
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, 'intermediate' sequence (Q --> I --> M). This approach has often been suggested as providing greater sensitivity in sequence comparison; however; it has not yet been possible to gauge its improvement precisely. Results: Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called 'pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called 'baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison tie. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0.001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular 862 of the baseline set pairs fit to better than 2.6 Angstrom RMS, and transitive matching can find 62 of these (9%).
引用
收藏
页码:707 / 714
页数:8
相关论文
共 52 条
  • [1] Do aligned sequences share the same fold?
    Abagyan, RA
    Batalov, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 273 (01) : 355 - 368
  • [2] ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES
    ALTSCHUL, SF
    BOGUSKI, MS
    GISH, W
    WOOTTON, JC
    [J]. NATURE GENETICS, 1994, 6 (02) : 119 - 129
  • [3] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [4] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [5] PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES
    BERNSTEIN, FC
    KOETZLE, TF
    WILLIAMS, GJB
    MEYER, EF
    BRICE, MD
    RODGERS, JR
    KENNARD, O
    SHIMANOUCHI, T
    TASUMI, M
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) : 535 - 542
  • [6] BLEASBY AJ, 1994, NUCLEIC ACIDS RES, V22, P3574
  • [7] CONSTRUCTION OF VALIDATED, NONREDUNDANT COMPOSITE PROTEIN-SEQUENCE DATABASES
    BLEASBY, AJ
    WOOTTON, JC
    [J]. PROTEIN ENGINEERING, 1990, 3 (03): : 153 - 159
  • [8] A METHOD TO IDENTIFY PROTEIN SEQUENCES THAT FOLD INTO A KNOWN 3-DIMENSIONAL STRUCTURE
    BOWIE, JU
    LUTHY, R
    EISENBERG, D
    [J]. SCIENCE, 1991, 253 (5016) : 164 - 170
  • [9] BRENNER S, 1998, IN PRESS P NATL ACAD
  • [10] GENE DUPLICATIONS IN HAEMOPHILUS-INFLUENZAE
    BRENNER, SE
    HUBBARD, T
    MURZIN, A
    CHOTHIA, C
    [J]. NATURE, 1995, 378 (6553) : 140 - 140