CSMR: A scalable algorithm for text clustering with cosine similarity and MapReduce

被引:0
|
作者
Victor, Giannakouris-Salalidis [1 ]
Antonia, Plerou [1 ]
Spyros, Sioutas [1 ]
机构
[1] Ionian University, Department of Informatics, Greece
关键词
Cosine similarity - Cosine similarity metric - Hadoop - Map-reduce - Normalization methods - Scalable algorithms - Term frequency-inverse document frequencies - TF-IDF;
D O I
10.1007/978-3-662-44722-2_23
中图分类号
学科分类号
摘要
As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency- Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases. © IFIP International Federation for Information Processing 2014.
引用
收藏
页码:211 / 220
相关论文
共 50 条
  • [21] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
    Lee, Dongjoo
    Park, Jaehui
    Shim, Junho
    Lee, Sang-goo
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +
  • [22] An Improved Cosine Similarity Algorithm Based on Document Similarity
    Lee, Ming
    Zhao, Heji
    INTERNATIONAL SYMPOSIUM ON FUZZY SYSTEMS, KNOWLEDGE DISCOVERY AND NATURAL COMPUTATION (FSKDNC 2014), 2014, : 196 - 204
  • [23] Clustering Algorithm for Privacy Preservation on MapReduce
    Zhao, Zheng
    Shang, Tao
    Liu, Jianwei
    Guan, Zhengyu
    CLOUD COMPUTING AND SECURITY, PT II, 2018, 11064 : 622 - 632
  • [24] MapReduce FCM clustering set algorithm
    Mesmin J Mbyamm Kiki
    Jianbiao Zhang
    Bonzou Adolphe Kouassi
    Cluster Computing, 2021, 24 : 489 - 500
  • [25] MapReduce FCM clustering set algorithm
    Kiki, Mesmin J. Mbyamm
    Zhang, Jianbiao
    Kouassi, Bonzou Adolphe
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (01): : 489 - 500
  • [26] A fast text similarity measure for large document collections using multireference cosine and genetic algorithm
    Mohammadi, Hamid
    Khasteh, Seyed Hossein
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2020, 28 (02) : 999 - 1013
  • [27] Distance Weighted Cosine Similarity Measure for Text Classification
    Li, Baoli
    Han, Liping
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2013, 2013, 8206 : 611 - 618
  • [28] Scalable Collaborative Filtering Recommendation Algorithm with MapReduce
    Shang, Yang
    Li, Zhiyang
    Qu, Wenyu
    Xu, Yujie
    Song, Zining
    Zhou, Xuefei
    2014 IEEE 12TH INTERNATIONAL CONFERENCE ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING (DASC)/2014 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTING (EMBEDDEDCOM)/2014 IEEE 12TH INTERNATIONAL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING (PICOM), 2014, : 103 - 108
  • [29] A Methodology Combining Cosine Similarity with Classifier for Text Classification
    Park, Kwangil
    Hong, June Seok
    Kim, Wooju
    APPLIED ARTIFICIAL INTELLIGENCE, 2020, 34 (05) : 396 - 411
  • [30] Towards a Scalable Set Similarity Join Using MapReduce and LSH
    Rivault, Sebastien
    Bamha, Mostafa
    Limet, Sebastien
    Robert, Sophie
    COMPUTATIONAL SCIENCE - ICCS 2022, PT I, 2022, : 569 - 583