CD-HIT: accelerated for clustering the next-generation sequencing data

被引:6677
作者
Fu, Limin [1 ]
Niu, Beifang [1 ]
Zhu, Zhengwei [1 ]
Wu, Sitao [1 ]
Li, Weizhong [1 ]
机构
[1] Univ Calif San Diego, Ctr Res Biol Syst, La Jolla, CA 92093 USA
关键词
PROTEIN; IDENTIFICATION;
D O I
10.1093/bioinformatics/bts565
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to similar to 24 cores and a quasi-linear speedup for up to similar to 8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.
引用
收藏
页码:3150 / 3152
页数:3
相关论文
共 11 条
  • [1] Search and clustering orders of magnitude faster than BLAST
    Edgar, Robert C.
    [J]. BIOINFORMATICS, 2010, 26 (19) : 2460 - 2461
  • [2] Unique folding of precursor microRNAs: Quantitative evidence and implications for de novo identification
    Kwang Loong, Stanley Ng
    Mishra, Santosh K.
    [J]. RNA, 2007, 13 (02) : 170 - 187
  • [3] Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
    Li, Weizhong
    Godzik, Adam
    [J]. BIOINFORMATICS, 2006, 22 (13) : 1658 - 1659
  • [4] Clustering of highly homologous sequences to reduce the size of large protein databases
    Li, WZ
    Jaroszewski, L
    Godzik, A
    [J]. BIOINFORMATICS, 2001, 17 (03) : 282 - 283
  • [5] Artificial and natural duplicates in pyrosequencing reads of metagenomic data
    Niu, Beifang
    Fu, Limin
    Sun, Shulei
    Li, Weizhong
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [6] A human gut microbial gene catalogue established by metagenomic sequencing
    Qin, Junjie
    Li, Ruiqiang
    Raes, Jeroen
    Arumugam, Manimozhiyan
    Burgdorf, Kristoffer Solvsten
    Manichanh, Chaysavanh
    Nielsen, Trine
    Pons, Nicolas
    Levenez, Florence
    Yamada, Takuji
    Mende, Daniel R.
    Li, Junhua
    Xu, Junming
    Li, Shaochuan
    Li, Dongfang
    Cao, Jianjun
    Wang, Bo
    Liang, Huiqing
    Zheng, Huisong
    Xie, Yinlong
    Tap, Julien
    Lepage, Patricia
    Bertalan, Marcelo
    Batto, Jean-Michel
    Hansen, Torben
    Le Paslier, Denis
    Linneberg, Allan
    Nielsen, H. Bjorn
    Pelletier, Eric
    Renault, Pierre
    Sicheritz-Ponten, Thomas
    Turner, Keith
    Zhu, Hongmei
    Yu, Chang
    Li, Shengting
    Jian, Min
    Zhou, Yan
    Li, Yingrui
    Zhang, Xiuqing
    Li, Songgang
    Qin, Nan
    Yang, Huanming
    Wang, Jian
    Brunak, Soren
    Dore, Joel
    Guarner, Francisco
    Kristiansen, Karsten
    Pedersen, Oluf
    Parkhill, Julian
    Weissenbach, Jean
    [J]. NATURE, 2010, 464 (7285) : 59 - U70
  • [7] Predicting disulfide bond connectivity in proteins by correlated mutations analysis
    Rubinstein, Rotem
    Fiser, Andras
    [J]. BIOINFORMATICS, 2008, 24 (04) : 498 - 504
  • [8] Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource
    Sun, Shulei
    Chen, Jing
    Li, Weizhong
    Altintas, Ilkay
    Lin, Abel
    Peltier, Steve
    Stocks, Karen
    Allen, Eric E.
    Ellisman, Mark
    Grethe, Jeffrey
    Wooley, John
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : D546 - D551
  • [9] UniRef: comprehensive and non-redundant UniProt reference clusters
    Suzek, Baris E.
    Huang, Hongzhan
    McGarvey, Peter
    Mazumder, Raja
    Wu, Cathy H.
    [J]. BIOINFORMATICS, 2007, 23 (10) : 1282 - 1288
  • [10] A core gut microbiome in obese and lean twins
    Turnbaugh, Peter J.
    Hamady, Micah
    Yatsunenko, Tanya
    Cantarel, Brandi L.
    Duncan, Alexis
    Ley, Ruth E.
    Sogin, Mitchell L.
    Jones, William J.
    Roe, Bruce A.
    Affourtit, Jason P.
    Egholm, Michael
    Henrissat, Bernard
    Heath, Andrew C.
    Knight, Rob
    Gordon, Jeffrey I.
    [J]. NATURE, 2009, 457 (7228) : 480 - U7