CSMR: A scalable algorithm for text clustering with cosine similarity and MapReduce

被引：0

作者：

Victor, Giannakouris-Salalidis ^{[1
]}

Antonia, Plerou ^{[1
]}

Spyros, Sioutas ^{[1
]}

机构：

[1] Ionian University, Department of Informatics, Greece

来源：

IFIP Advances in Information and Communication Technology | 2014年 / 437卷

关键词：

Cosine similarity - Cosine similarity metric - Hadoop - Map-reduce - Normalization methods - Scalable algorithms - Term frequency-inverse document frequencies - TF-IDF;

D O I：

10.1007/978-3-662-44722-2_23

中图分类号：

学科分类号：

摘要：

As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency- Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases. © IFIP International Federation for Information Processing 2014.

引用

页码：211 / 220

共 50 条

[31] Incomplete multi-view clustering with cosine similarity
Yin, Jun
Sun, Shiliang
PATTERN RECOGNITION, 2022, 123
[32] Hierarchical Document Clustering based on Cosine Similarity measure
Popat, Shraddha K.
Deshmukh, Pramod B.
Metre, Vishakha A.
2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, : 153 - 159
[33] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
Xia, Haoxiang
Wang, Shuguang
Yoshida, Taketoshi
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2006, 15 (04) : 474 - 492
[34] A modified ant-based text clustering algorithm with semantic similarity measure
Haoxiang Xia
Shuguang Wang
Taketoshi Yoshida
Journal of Systems Science and Systems Engineering, 2006, 15 : 474 - 492
[35] Similarity matrix-based K-means algorithm for text clustering
曹奇敏
郭巧
吴向华
JournalofBeijingInstituteofTechnology, 2015, 24 (04) : 566 - 572
[36] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
Taketoshi YOSHIDA
Journal of Systems Science and Systems Engineering, 2006, (04) : 474 - 492
[37] Text clustering based on asymmetric similarity
School of Software, Tsinghua University, Beijing 100084, China
Qinghua Daxue Xuebao, 2006, 7 (1325-1328):
[38] A Similarity Measure for Text Classification and Clustering
Lin, Yung-Shen
Jiang, Jung-Yi
Lee, Shie-Jue
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) : 1575 - 1590
[39] A Big Graph Clustering Algorithm Based on MapReduce
Leng, Yonglin
Zhang, Qingchen
MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1467 - +
[40] MassJoin: A MapReduce-based Method for Scalable String Similarity Joins
Deng, Dong
Li, Guoliang
Hao, Shuang
Wang, Jiannan
Feng, Jianhua
2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 340 - 351

← 1 2 3 4 5 →