A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting

被引：19

作者：

Liao, Ruiqi ^{[1
,2
]}

Zhang, Ruichang ^{[1
,2
]}

Guan, Jihong ^{[3
]}

Zhou, Shuigeng ^{[1
,2
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China

[3] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | 2014年 / 11卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Metagenomics; binning; N-grams; feature weighting; algorithms; PHYLOGENETIC CLASSIFICATION; DNA-SEQUENCES; GENOMES; ALGORITHM;

D O I：

10.1109/TCBB.2013.137

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

The rapid development of high-throughput technologies enables researchers to sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these sequence reads into different species or taxonomical classes is a crucial step for metagenomic analysis, which is referred to as binning of metagenomic data. Most traditional binning methods rely on known reference genomes for accurate assignment of the sequence reads, therefore cannot classify reads from unknown species without the help of close references. To overcome this drawback, unsupervised learning based approaches have been proposed, which need not any known species' reference genome for help. In this paper, we introduce a novel unsupervised method called MCluster for binning metagenomic sequences. This method uses N-grams to extract sequence features and utilizes automatic feature weighting to improve the performance of the basic K-means clustering algorithm. We evaluate MCluster on a variety of simulated data sets and a real data set, and compare it with three latest binning methods: AbundanceBin, MetaCluster 3.0, and MetaCluster 5.0. Experimental results show that MCluster achieves obviously better overall performance (F-measure) than AbundanceBin and MetaCluster 3.0 on long metagenomic reads (>= 800 bp); while compared with MetaCluster 5.0, MCluster obtains a larger sensitivity, and a comparable yet more stable F-measure on short metagenomic reads (<300 bp). This suggests that MCluster can serve as a promising tool for effectively binning metagenomic sequences.

引用

页码：42 / 54

页数：13

共 11 条

[1] MetaAB - A Novel Abundance-Based Binning Approach for Metagenomic Sequences
Van-Vinh Le
Tran Van Lang
Tran Van Hoai
NATURE OF COMPUTATION AND COMMUNICATION, 2015, 144 : 132 - 141
[2] Towards an automatic classification of images: Approach by the n-grams
Laouamer, Lamri
Biskri, Ismail
Houmadi, Benamar
WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 3, 2005, : 73 - 78
[3] On Automatic Plagiarism Detection Based on n-Grams Comparison
Barron-Cedeno, Alberto
Rosso, Paolo
ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 696 - 700
[4] UNORDERED N-GRAMS: NEW APPROACH IN TEXT PLAGIARISM DETECTION
Pribil, Jiri
Leseticky, Ondrej
Kubalova, Kamila
INFORMATION TECHNOLOGIES' 2009, 2009, : 243 - 249
[5] HSS-Bin: An Unsupervised Metagenomic Binning Method Based on Hybrid Sequence Feature Recognition and Spectral Clustering
Ding, Xiao
Cao, Chang-Chang
Liu, Xu-Ying
Cheng, Fu-Dong
Luo, Xing
Sun, Xiao
CURRENT BIOINFORMATICS, 2016, 11 (03) : 330 - 339
[6] New malware detection framework based on N-grams and SVDD with SMO
El Boujnouni, Mohamed
Jedra, Mohamed
Zahid, Noureddine
JOURNAL OF INFORMATION ASSURANCE AND SECURITY, 2016, 11 (04): : 223 - 232
[7] TOWARDS A MOLECULES PRODUCTION FROM DNA SEQUENCES BASED ON CLUSTERING BY 3D CELLULAR AUTOMATA APPROACH AND N-GRAMS TECHNIQUE
Kabli, Fatima
Hamou, Reda Mohamed
Amine, Abdelmalek
2015 IEEE/ACS 12TH INTERNATIONAL CONFERENCE OF COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2015,
[8] XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams
Costa, Gianni
Ortale, Riccardo
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2017, 26 (01)
[9] Fake news spreaders profiling using N-grams of various types and SHAP-based feature selection
Balouchzahi, Fazlourrahman
Sidorov, Grigori
Shashirekha, Hosahalli Lakshmaiah
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (05) : 4437 - 4448
[10] The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
Stefanovic, Pavel
Kurasova, Olga
Strimaitis, Rokas
APPLIED SCIENCES-BASEL, 2019, 9 (09):

← 1 2 →