Performance comparison of gene family clustering methods with expert curated gene family data set in Arabidopsis thaliana
被引:4
作者:
Yang, Kuan
论文数: 0引用数: 0
h-index: 0
机构:
Virginia Tech, Virginia Bioinformat Inst, Blacksburg, VA 24061 USA
Virginia Tech, Program Genet Bioinformat & Computat Biol, Blacksburg, VA 24061 USAVirginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
Yang, Kuan
[2
,3
]
Zhang, Liqing
论文数: 0引用数: 0
h-index: 0
机构:
Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
Virginia Tech, Program Genet Bioinformat & Computat Biol, Blacksburg, VA 24061 USAVirginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
Zhang, Liqing
[1
,3
]
机构:
[1] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
[2] Virginia Tech, Virginia Bioinformat Inst, Blacksburg, VA 24061 USA
[3] Virginia Tech, Program Genet Bioinformat & Computat Biol, Blacksburg, VA 24061 USA
With the exponential growth of genomics data, the demand for reliable clustering methods is increasing every day. Despite the wide usage of many clustering algorithms, the accuracy of these algorithms has been evaluated mostly on simulated data sets and seldom on real biological data for which a "correct answer" is available. In order to address this issue, we use the manually curated high-quality Arabidopsis thaliana gene family database as a "gold standard" to conduct a comprehensive comparison of the accuracies of four widely used clustering methods including K-means, TribeMCL, single-linkage clustering and complete-linkage clustering. We compare the results from running different clustering methods on two matrices: the E-value matrix and the k-tuple distance matrix. The E-value matrix is computed based on BLAST E-values. The k-tuple distance matrix is computed based on the difference in tuple frequencies. The TribeMCL with the E-value matrix performed best, with the Inflation parameter (=1.15) tuned considerably lower than what has been suggested previously (=2). The single-linkage clustering method with the E-value matrix was second best. Single-linkage clustering, K-means clustering, complete-linkage clustering, and TribeMCL with a k-tuple distance matrix performed reasonably well. Complete-linkage clustering with the k-tuple distance matrix performed the worst.
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Garcia-Hernandez M.
Berardini T.Z.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Berardini T.Z.
Chen G.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Chen G.
Crist D.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Crist D.
Doyle A.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Doyle A.
Huala E.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Huala E.
Knee E.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Knee E.
Lambrecht M.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Lambrecht M.
Miller N.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Miller N.
Mueller L.A.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Mueller L.A.
Mundodi S.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Mundodi S.
Reiser L.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Reiser L.
Rhee S.Y.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Rhee S.Y.
Scholl R.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Scholl R.
Tacklind J.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Tacklind J.
Weems D.C.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Weems D.C.
Wu Y.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Wu Y.
Xu I.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Xu I.
Yoo D.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Yoo D.
Yoon J.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Yoon J.
Zhang P.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Garcia-Hernandez M.
Berardini T.Z.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Berardini T.Z.
Chen G.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Chen G.
Crist D.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Crist D.
Doyle A.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Doyle A.
Huala E.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Huala E.
Knee E.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Knee E.
Lambrecht M.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Lambrecht M.
Miller N.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Miller N.
Mueller L.A.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Mueller L.A.
Mundodi S.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Mundodi S.
Reiser L.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Reiser L.
Rhee S.Y.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Rhee S.Y.
Scholl R.
论文数: 0引用数: 0
h-index: 0
机构:
Arabidopsis Biological Resource Center, Ohio State University, 309 Botany and Zoology Bldg., Columbus, OH 43210TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Scholl R.
Tacklind J.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Tacklind J.
Weems D.C.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Weems D.C.
Wu Y.
论文数: 0引用数: 0
h-index: 0
机构:
National Center for Genome Resources, Santa Fe, NM 87505TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Wu Y.
Xu I.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Xu I.
Yoo D.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Yoo D.
Yoon J.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305
Yoon J.
Zhang P.
论文数: 0引用数: 0
h-index: 0
机构:
TAIR, Carnegie Institution of Washington, Stanford, CA 94305TAIR, Carnegie Institution of Washington, Stanford, CA 94305