CSMR: A scalable algorithm for text clustering with cosine similarity and MapReduce

被引:0
|
作者
Victor, Giannakouris-Salalidis [1 ]
Antonia, Plerou [1 ]
Spyros, Sioutas [1 ]
机构
[1] Ionian University, Department of Informatics, Greece
关键词
Cosine similarity - Cosine similarity metric - Hadoop - Map-reduce - Normalization methods - Scalable algorithms - Term frequency-inverse document frequencies - TF-IDF;
D O I
10.1007/978-3-662-44722-2_23
中图分类号
学科分类号
摘要
As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency- Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases. © IFIP International Federation for Information Processing 2014.
引用
收藏
页码:211 / 220
相关论文
共 50 条
  • [31] Incomplete multi-view clustering with cosine similarity
    Yin, Jun
    Sun, Shiliang
    PATTERN RECOGNITION, 2022, 123
  • [32] Hierarchical Document Clustering based on Cosine Similarity measure
    Popat, Shraddha K.
    Deshmukh, Pramod B.
    Metre, Vishakha A.
    2017 1ST INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND INFORMATION MANAGEMENT (ICISIM), 2017, : 153 - 159
  • [33] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
    Xia, Haoxiang
    Wang, Shuguang
    Yoshida, Taketoshi
    JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2006, 15 (04) : 474 - 492
  • [34] A modified ant-based text clustering algorithm with semantic similarity measure
    Haoxiang Xia
    Shuguang Wang
    Taketoshi Yoshida
    Journal of Systems Science and Systems Engineering, 2006, 15 : 474 - 492
  • [35] Similarity matrix-based K-means algorithm for text clustering
    曹奇敏
    郭巧
    吴向华
    JournalofBeijingInstituteofTechnology, 2015, 24 (04) : 566 - 572
  • [36] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
    Taketoshi YOSHIDA
    Journal of Systems Science and Systems Engineering, 2006, (04) : 474 - 492
  • [37] Text clustering based on asymmetric similarity
    School of Software, Tsinghua University, Beijing 100084, China
    Qinghua Daxue Xuebao, 2006, 7 (1325-1328):
  • [38] A Similarity Measure for Text Classification and Clustering
    Lin, Yung-Shen
    Jiang, Jung-Yi
    Lee, Shie-Jue
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (07) : 1575 - 1590
  • [39] A Big Graph Clustering Algorithm Based on MapReduce
    Leng, Yonglin
    Zhang, Qingchen
    MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1467 - +
  • [40] MassJoin: A MapReduce-based Method for Scalable String Similarity Joins
    Deng, Dong
    Li, Guoliang
    Hao, Shuang
    Wang, Jiannan
    Feng, Jianhua
    2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 340 - 351