CSMR: A scalable algorithm for text clustering with cosine similarity and MapReduce

被引:0
|
作者
Victor, Giannakouris-Salalidis [1 ]
Antonia, Plerou [1 ]
Spyros, Sioutas [1 ]
机构
[1] Ionian University, Department of Informatics, Greece
关键词
Cosine similarity - Cosine similarity metric - Hadoop - Map-reduce - Normalization methods - Scalable algorithms - Term frequency-inverse document frequencies - TF-IDF;
D O I
10.1007/978-3-662-44722-2_23
中图分类号
学科分类号
摘要
As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency- Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases. © IFIP International Federation for Information Processing 2014.
引用
收藏
页码:211 / 220
相关论文
共 50 条
  • [1] Scalable spectral clustering with cosine similarity
    Chen, Guangliang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 314 - 319
  • [2] A Scalable Spectral Clustering Algorithm Based on Landmark-Embedding and Cosine Similarity
    Chen, Guangliang
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2018, 2018, 11004 : 52 - 62
  • [3] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Sébastien Rivault
    Mostafa Bamha
    Sébastien Limet
    Sophie Robert
    International Journal of Parallel Programming, 2022, 50 : 360 - 380
  • [4] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Rivault, Sebastien
    Bamha, Mostafa
    Limet, Sebastien
    Robert, Sophie
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2022, 50 (3-4) : 360 - 380
  • [5] A fast incremental spectral clustering algorithm with cosine similarity
    Li, Ran
    Chen, Guangliang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 80 - 88
  • [6] Hierarchical Clustering Algorithm for Binary Data Based on Cosine Similarity
    Gao, Xiaonan
    Wu, Sen
    2018 8TH INTERNATIONAL CONFERENCE ON LOGISTICS, INFORMATICS AND SERVICE SCIENCES (LISS), 2018,
  • [7] Text Document Clustering Approach by Improved Sine Cosine Algorithm
    Radomirovic, Branislav
    Jovanovic, Vuk
    Nikolic, Bosko
    Stojanovic, Sasa
    Venkatachalam, K.
    Zivkovic, Miodrag
    Njegus, Angelina
    Bacanin, Nebojsa
    Strumberger, Ivana
    INFORMATION TECHNOLOGY AND CONTROL, 2023, 52 (02): : 541 - 561
  • [8] Batch Text Similarity Search with MapReduce
    Li, Rui
    Ju, Li
    Peng, Zhuo
    Yu, Zhiwei
    Wang, Chaokun
    WEB TECHNOLOGIES AND APPLICATIONS, 2011, 6612 : 412 - +
  • [9] MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm
    Li, Yufeng
    Jiang, HaiTian
    Lu, Jiyong
    Li, Xiaozhong
    Sun, Zhiwei
    Li, Min
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (03) : 5295 - 5305
  • [10] Fast and scalable vector similarity joins with MapReduce
    Yang, Byoungju
    Kim, Hyun Joon
    Shim, Junho
    Lee, Dongjoo
    Lee, Sang-goo
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (03) : 473 - 497