A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

被引:0
|
作者
Benlaehmi, Yassine [1 ]
El Yazidi, Abdelaziz [1 ]
Hasnaoui, Moulay Lahcen [1 ]
机构
[1] ENSAM Moulay Ismail Univ, LMMI Lab, Meknes 50000, Morocco
关键词
Big data; hadoop; spark; machine learning; Hadoop Distributed File System (HDFS)); mapreduce; word count; BIG DATA; PLACEMENT STRATEGY; ANALYTICS; IMPACT;
D O I
10.14569/IJACSA.2021.0120495
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark's in-line memory processing could reduce the computational time of the Word Count Algorithm.
引用
收藏
页码:778 / 788
页数:11
相关论文
共 50 条
  • [31] Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop
    Komal Dhingra
    Sumit Kr Yadav
    International Journal of Machine Learning and Cybernetics, 2019, 10 : 2143 - 2162
  • [32] Scalable Data Analytics Using R: Single Machines to Hadoop Spark Clusters
    Agosta, John-Mark
    GuhaThakurta, Debraj
    Horton, Robert
    Inchiosa, Mario
    Kumar, Srini
    Zhao, Mengyue
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 2115 - 2115
  • [33] Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop
    Dhingra, Komal
    Yadav, Sumit Kr
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (08) : 2143 - 2162
  • [34] Hugepage & Swappiness functions for Optimization of the Search Graph algorithm Using Hadoop Framework
    Narooka, Preeti
    Choudhary, Sunita
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 270 - 274
  • [35] Improved Apriori Algorithm Using Power Set on Hadoop
    Imran, Abdullah
    Ranjan, Prabhat
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS, ICCII 2016, 2017, 507 : 245 - 254
  • [36] An improved content splitting and merging algorithm for Hadoop clusters using component analysis and hamming distance
    Singh B.
    Verma H.K.
    Kumar G.
    Kim H.-J.
    International Journal of Technology, Policy and Management, 2019, 19 (04) : 392 - 404
  • [37] A Scalable Short-Text Clustering Algorithm Using Apache Spark
    Akritidis, Leonidas
    Alamaniotis, Miltiadis
    Fevgas, Athanasios
    Bozanis, Panayiotis
    2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 927 - 934
  • [38] Novel Weather Data Analysis Using Hadoop and MapReduce - A Case Study
    Suryanarayana, V.
    Sathish, B. S.
    Ranganayakulu, A.
    Ganesan, P.
    2019 5TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2019, : 204 - 207
  • [39] Implementation of an Improved Algorithm for Frequent Itemset Mining using Hadoop
    Agarwal, Ruchi
    Singh, Sunny
    Vats, Satvik
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 13 - 18
  • [40] Statistical analysis of multi job processing in Hadoop environment using schedulers
    Prasad, M. S. Guru
    Singh, Prabhdeep
    Taneja, Harsh
    Jain, Amith K.
    Chandrappa, S.
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2022, 43 (03): : 497 - 504