A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

被引：0

作者：

Benlaehmi, Yassine ^{[1
]}

El Yazidi, Abdelaziz ^{[1
]}

Hasnaoui, Moulay Lahcen ^{[1
]}

机构：

[1] ENSAM Moulay Ismail Univ, LMMI Lab, Meknes 50000, Morocco

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2021年 / 12卷 / 04期

关键词：

Big data; hadoop; spark; machine learning; Hadoop Distributed File System (HDFS)); mapreduce; word count; BIG DATA; PLACEMENT STRATEGY; ANALYTICS; IMPACT;

D O I：

10.14569/IJACSA.2021.0120495

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark's in-line memory processing could reduce the computational time of the Word Count Algorithm.

引用

页码：778 / 788

页数：11

共 50 条

[31] Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop
Komal Dhingra
Sumit Kr Yadav
International Journal of Machine Learning and Cybernetics, 2019, 10 : 2143 - 2162
[32] Scalable Data Analytics Using R: Single Machines to Hadoop Spark Clusters
Agosta, John-Mark
GuhaThakurta, Debraj
Horton, Robert
Inchiosa, Mario
Kumar, Srini
Zhao, Mengyue
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 2115 - 2115
[33] Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop
Dhingra, Komal
Yadav, Sumit Kr
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (08) : 2143 - 2162
[34] Hugepage & Swappiness functions for Optimization of the Search Graph algorithm Using Hadoop Framework
Narooka, Preeti
Choudhary, Sunita
2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 270 - 274
[35] Improved Apriori Algorithm Using Power Set on Hadoop
Imran, Abdullah
Ranjan, Prabhat
PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS, ICCII 2016, 2017, 507 : 245 - 254
[36] An improved content splitting and merging algorithm for Hadoop clusters using component analysis and hamming distance
Singh B.
Verma H.K.
Kumar G.
Kim H.-J.
International Journal of Technology, Policy and Management, 2019, 19 (04) : 392 - 404
[37] A Scalable Short-Text Clustering Algorithm Using Apache Spark
Akritidis, Leonidas
Alamaniotis, Miltiadis
Fevgas, Athanasios
Bozanis, Panayiotis
2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 927 - 934
[38] Novel Weather Data Analysis Using Hadoop and MapReduce - A Case Study
Suryanarayana, V.
Sathish, B. S.
Ranganayakulu, A.
Ganesan, P.
2019 5TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING & COMMUNICATION SYSTEMS (ICACCS), 2019, : 204 - 207
[39] Implementation of an Improved Algorithm for Frequent Itemset Mining using Hadoop
Agarwal, Ruchi
Singh, Sunny
Vats, Satvik
2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 13 - 18
[40] Statistical analysis of multi job processing in Hadoop environment using schedulers
Prasad, M. S. Guru
Singh, Prabhdeep
Taneja, Harsh
Jain, Amith K.
Chandrappa, S.
JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2022, 43 (03): : 497 - 504

← 1 2 3 4 5 →