A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

被引：0

作者：

Benlaehmi, Yassine ^{[1
]}

El Yazidi, Abdelaziz ^{[1
]}

Hasnaoui, Moulay Lahcen ^{[1
]}

机构：

[1] ENSAM Moulay Ismail Univ, LMMI Lab, Meknes 50000, Morocco

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2021年 / 12卷 / 04期

关键词：

Big data; hadoop; spark; machine learning; Hadoop Distributed File System (HDFS)); mapreduce; word count; BIG DATA; PLACEMENT STRATEGY; ANALYTICS; IMPACT;

D O I：

10.14569/IJACSA.2021.0120495

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark's in-line memory processing could reduce the computational time of the Word Count Algorithm.

引用

页码：778 / 788

页数：11

共 50 条

[21] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
Wei, Chih-Chiang
Chou, Tzu-Hao
ATMOSPHERE, 2020, 11 (08)
[22] Architecture of Efficient Word Processing using Hadoop MapReduce for Big Data Applications
Mandal, Bichitra
Sahoo, Ramesh Kumar
Sethi, Srinivas
PROCEEDINGS 2015 INTERNATIONAL CONFERENCE ON MAN AND MACHINE INTERFACING (MAMI), 2015,
[23] An Efficient Improved Join Algorithm Using Map Reduce in Hadoop
Patel, Warish D.
Vaghela, Dineshkumar B.
2014 INTERNATIONAL CONFERENCE ON SIGNAL PROPAGATION AND COMPUTER TECHNOLOGY (ICSPCT 2014), 2014, : 263 - 272
[24] Crime Data Analysis Using Pig with Hadoop
Jain, Arushi
Bhatnagar, Vishal
1ST INTERNATIONAL CONFERENCE ON INFORMATION SECURITY & PRIVACY 2015, 2016, 78 : 571 - 578
[25] Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster
Prasad, Kashi Sai
Pasupathy, S.
JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2018, 13 (05): : 164 - 176
[26] Hadoop-MCC: Efficient Multiple Compound Comparison Algorithm Using Hadoop
Hua, Guan-Jie
Hung, Che-Lun
Tang, Chuan Yi
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2018, 21 (02) : 84 - 92
[27] Performance Comparison of a Parallel Recommender Algorithm across three Hadoop-based Frameworks
Diedhiou, Christina
Carpenter, Bryan
Shafi, Aamir
Sarkar, Soumabha
Esmeli, Ramazan
Gadsdon, Ryan
2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 380 - 387
[28] MapReduce Based Analysis of Sample Applications Using Hadoop
Ghazi, Mohd Rehan
Raghava, N. S.
APPLICATIONS OF COMPUTING AND COMMUNICATION TECHNOLOGIES, ICACCT 2018, 2018, 899 : 34 - 44
[29] Movie Dataset Analysis using Hadoop-Hive
Ashwitha, T. A.
Rodrigues, Anisha P.
Chiplunkar, Niranjan N.
2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 181 - 186
[30] LOG ANALYSIS IN CLOUD COMPUTING ENVIRONMENT WITH HADOOP AND SPARK
Lin, Xiuqin
Wang, Peng
Wu, Bin
2013 5TH IEEE INTERNATIONAL CONFERENCE ON BROADBAND NETWORK & MULTIMEDIA TECHNOLOGY (IC-BNMT), 2013, : 273 - 276

← 1 2 3 4 5 →