Effect of k-tuple length on sample-comparison with high-throughput sequencing data

被引:7
作者
Wang, Ying [1 ]
Lei, Xiaoye [1 ]
Wang, Shun [1 ]
Wang, Zicheng [2 ,3 ]
Song, Nianfeng [1 ]
Zeng, Feng [1 ]
Chen, Ting [2 ,4 ,5 ]
机构
[1] Xiamen Univ, Dept Automat, Xiamen 361005, Fujian, Peoples R China
[2] Tsinghua Univ, Bioinformat Div, TNLIST, Beijing 100084, Peoples R China
[3] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[4] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[5] Univ So Calif, Program Computat Biol & Bioinformat, Los Angeles, CA 90089 USA
基金
中国国家自然科学基金;
关键词
Metagenomics; High throughput sequencing; Alignment-free; Long k-tuple; Clustering; Text mining; ALIGNMENT; PHYLOGENY;
D O I
10.1016/j.bbrc.2015.11.094
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2-10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k >= 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as "topic analysis of documents" in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/. Experiments show that our pipeline with long k-tuple features: (1)separates genomes with high similarity; (2)outperforms short k-tuple models in all experiments. When k >= 12, the short k-tuple measures are not applicable anymore. When k is between 20 and 40, long k-tuple pipeline obtains much better grouping results; (3)is free from the effect of sequencing platforms/protocols. (3)We obtained meaningful and supported biological results on the 40-tuples selected for comparison. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:1021 / 1027
页数:7
相关论文
共 19 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] The analysis of oral microbial communities of wild-type and toll-like receptor 2-deficient mice using a 454 GS FLX Titanium pyrosequencer
    Chun, Jongsik
    Kim, Kap Y.
    Lee, Jae-Hak
    Choi, Youngnim
    [J]. BMC MICROBIOLOGY, 2010, 10 : 101
  • [3] Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison
    Dai, Qi
    Yang, Yanchun
    Wang, Tianming
    [J]. BIOINFORMATICS, 2008, 24 (20) : 2296 - 2302
  • [4] Evolutionary and biomedical insights from the rhesus macaque genome
    Gibbs, Richard A.
    Rogers, Jeffrey
    Katze, Michael G.
    Bumgarner, Roger
    Weinstock, George M.
    Mardis, Elaine R.
    Remington, Karin A.
    Strausberg, Robert L.
    Venter, J. Craig
    Wilson, Richard K.
    Batzer, Mark A.
    Bustamante, Carlos D.
    Eichler, Evan E.
    Hahn, Matthew W.
    Hardison, Ross C.
    Makova, Kateryna D.
    Miller, Webb
    Milosavljevic, Aleksandar
    Palermo, Robert E.
    Siepel, Adam
    Sikela, James M.
    Attaway, Tony
    Bell, Stephanie
    Bernard, Kelly E.
    Buhay, Christian J.
    Chandrabose, Mimi N.
    Dao, Marvin
    Davis, Clay
    Delehaunty, Kimberly D.
    Ding, Yan
    Dinh, Huyen H.
    Dugan-Rocha, Shannon
    Fulton, Lucinda A.
    Gabisi, Ramatu Ayiesha
    Garner, Toni T.
    Godfrey, Jennifer
    Hawes, Alicia C.
    Hernandez, Judith
    Hines, Sandra
    Holder, Michael
    Hume, Jennifer
    Jhangiani, Shalini N.
    Joshi, Vandita
    Khan, Ziad Mohid
    Kirkness, Ewen F.
    Cree, Andrew
    Fowler, R. Gerald
    Lee, Sandra
    Lewis, Lora R.
    Li, Zhangwan
    [J]. SCIENCE, 2007, 316 (5822) : 222 - 234
  • [5] Comparison of metagenomic samples using sequence signatures
    Jiang, Bai
    Song, Kai
    Ren, Jie
    Deng, Minghua
    Sun, Fengzhu
    Zhang, Xuegong
    [J]. BMC GENOMICS, 2012, 13
  • [6] Evolution of mammals and their gut microbes
    Ley, Ruth E.
    Hamady, Micah
    Lozupone, Catherine
    Turnbaugh, Peter J.
    Ramey, Rob Roy
    Bircher, J. Stephen
    Schlegel, Michael L.
    Tucker, Tammy A.
    Schrenzel, Mark D.
    Knight, Rob
    Gordon, Jeffrey I.
    [J]. SCIENCE, 2008, 320 (5883) : 1647 - 1651
  • [7] Liu T., 2003, EVALUATION FEATURE S, P488
  • [8] Initial sequence of the chimpanzee genome and comparison with the human genome
    Mikkelsen, TS
    Hillier, LW
    Eichler, EE
    Zody, MC
    Jaffe, DB
    Yang, SP
    Enard, W
    Hellmann, I
    Lindblad-Toh, K
    Altheide, TK
    Archidiacono, N
    Bork, P
    Butler, J
    Chang, JL
    Cheng, Z
    Chinwalla, AT
    deJong, P
    Delehaunty, KD
    Fronick, CC
    Fulton, LL
    Gilad, Y
    Glusman, G
    Gnerre, S
    Graves, TA
    Hayakawa, T
    Hayden, KE
    Huang, XQ
    Ji, HK
    Kent, WJ
    King, MC
    Kulbokas, EJ
    Lee, MK
    Liu, G
    Lopez-Otin, C
    Makova, KD
    Man, O
    Mardis, ER
    Mauceli, E
    Miner, TL
    Nash, WE
    Nelson, JO
    Pääbo, S
    Patterson, NJ
    Pohl, CS
    Pollard, KS
    Prüfer, K
    Puente, XS
    Reich, D
    Rocchi, M
    Rosenbloom, K
    [J]. NATURE, 2005, 437 (7055) : 69 - 87
  • [9] Applications of next-generation sequencing technologies in functional genomics
    Morozova, Olena
    Marra, Marco A.
    [J]. GENOMICS, 2008, 92 (05) : 255 - 264
  • [10] Diet Drives Convergence in Gut Microbiome Functions Across Mammalian Phylogeny and Within Humans
    Muegge, Brian D.
    Kuczynski, Justin
    Knights, Dan
    Clemente, Jose C.
    Gonzalez, Antonio
    Fontana, Luigi
    Henrissat, Bernard
    Knight, Rob
    Gordon, Jeffrey I.
    [J]. SCIENCE, 2011, 332 (6032) : 970 - 974