CSMR: A scalable algorithm for text clustering with cosine similarity and MapReduce

被引:0
|
作者
Victor, Giannakouris-Salalidis [1 ]
Antonia, Plerou [1 ]
Spyros, Sioutas [1 ]
机构
[1] Ionian University, Department of Informatics, Greece
关键词
Cosine similarity - Cosine similarity metric - Hadoop - Map-reduce - Normalization methods - Scalable algorithms - Term frequency-inverse document frequencies - TF-IDF;
D O I
10.1007/978-3-662-44722-2_23
中图分类号
学科分类号
摘要
As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency- Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases. © IFIP International Federation for Information Processing 2014.
引用
收藏
页码:211 / 220
相关论文
共 50 条
  • [41] Scalable Quick Reduct Algorithm - Iterative MapReduce Approach
    Singh, Praveen Kumar
    Prasad, P. S. V. S. Sai
    PROCEEDINGS OF THE THIRD ACM IKDD CONFERENCE ON DATA SCIENCES (CODS), 2016,
  • [42] Fast, Memory-Efficient Spectral Clustering with Cosine Similarity
    Li, Ran
    Chen, Guangliang
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 700 - 714
  • [43] Document Clustering using Concept Space and Cosine Similarity Measurement
    Muflikhah, Lailil
    Baharudin, Baharum
    PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY AND DEVELOPMENT, VOL 1, 2009, : 58 - 62
  • [44] An efficient MapReduce algorithm for similarity join in metric spaces
    Wen Liu
    Yanming Shen
    Peng Wang
    The Journal of Supercomputing, 2016, 72 : 1179 - 1200
  • [45] Similarity detection of English text and teaching evaluation based on improved TCUSS clustering algorithm
    Wang, Yu
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (04) : 7555 - 7565
  • [46] K-Cosine-Means Clustering Algorithm
    Khan, Md Kafi
    Sarker, Sakil
    Ahmed, Syed Mahmud
    Khan, Mozammel H. A.
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND INFORMATION TECHNOLOGY 2021 (ICECIT 2021), 2021,
  • [47] MapReduce-based approach on short text conversation clustering
    Zhang, Y. (zyszjhz@163.com), 1600, Binary Information Press (10):
  • [48] Large Scale Text Clustering Method Study Based on MapReduce
    Sun, Zhanquan
    Li, Feng
    Zhao, Yanling
    Song, Lifeng
    ADVANCES IN NEURAL NETWORKS - ISNN 2015, 2015, 9377 : 365 - 372
  • [49] An efficient MapReduce algorithm for similarity join in metric spaces
    Liu, Wen
    Shen, Yanming
    Wang, Peng
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 1179 - 1200
  • [50] Scalable text semantic clustering around topics
    Brena, Ramon
    Ramirez, Eduardo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4645 - 4657