A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

被引:19
|
作者
Liao, Ruiqi [1 ,2 ]
Zhang, Ruichang [1 ,2 ]
Guan, Jihong [3 ]
Zhou, Shuigeng [1 ,2 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
[3] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China
基金
中国国家自然科学基金;
关键词
Metagenomics; binning; N-grams; feature weighting; algorithms; PHYLOGENETIC CLASSIFICATION; DNA-SEQUENCES; GENOMES; ALGORITHM;
D O I
10.1109/TCBB.2013.137
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species' reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance (F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads (>= 800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads (<300 bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.
引用
收藏
页码:42 / 54
页数:13
相关论文
共 11 条
  • [1] MetaAB - A Novel Abundance-Based Binning Approach for Metagenomic Sequences
    Van-Vinh Le
    Tran Van Lang
    Tran Van Hoai
    NATURE OF COMPUTATION AND COMMUNICATION, 2015, 144 : 132 - 141
  • [2] Towards an automatic classification of images: Approach by the n-grams
    Laouamer, Lamri
    Biskri, Ismail
    Houmadi, Benamar
    WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 3, 2005, : 73 - 78
  • [3] On Automatic Plagiarism Detection Based on n-Grams Comparison
    Barron-Cedeno, Alberto
    Rosso, Paolo
    ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 696 - 700
  • [4] UNORDERED N-GRAMS: NEW APPROACH IN TEXT PLAGIARISM DETECTION
    Pribil, Jiri
    Leseticky, Ondrej
    Kubalova, Kamila
    INFORMATION TECHNOLOGIES' 2009, 2009, : 243 - 249
  • [5] HSS-Bin: An Unsupervised Metagenomic Binning Method Based on Hybrid Sequence Feature Recognition and Spectral Clustering
    Ding, Xiao
    Cao, Chang-Chang
    Liu, Xu-Ying
    Cheng, Fu-Dong
    Luo, Xing
    Sun, Xiao
    CURRENT BIOINFORMATICS, 2016, 11 (03) : 330 - 339
  • [6] New malware detection framework based on N-grams and SVDD with SMO
    El Boujnouni, Mohamed
    Jedra, Mohamed
    Zahid, Noureddine
    JOURNAL OF INFORMATION ASSURANCE AND SECURITY, 2016, 11 (04): : 223 - 232
  • [7] TOWARDS A MOLECULES PRODUCTION FROM DNA SEQUENCES BASED ON CLUSTERING BY 3D CELLULAR AUTOMATA APPROACH AND N-GRAMS TECHNIQUE
    Kabli, Fatima
    Hamou, Reda Mohamed
    Amine, Abdelmalek
    2015 IEEE/ACS 12TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2015,
  • [8] XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams
    Costa, Gianni
    Ortale, Riccardo
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2017, 26 (01)
  • [9] Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection
    Balouchzahi, Fazlourrahman
    Sidorov, Grigori
    Shashirekha, Hosahalli Lakshmaiah
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (05) : 4437 - 4448
  • [10] The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
    Stefanovic, Pavel
    Kurasova, Olga
    Strimaitis, Rokas
    APPLIED SCIENCES-BASEL, 2019, 9 (09):