A Comparative Analysis of Hadoop and Spark Frameworks using Word Count Algorithm

被引:0
|
作者
Benlaehmi, Yassine [1 ]
El Yazidi, Abdelaziz [1 ]
Hasnaoui, Moulay Lahcen [1 ]
机构
[1] ENSAM Moulay Ismail Univ, LMMI Lab, Meknes 50000, Morocco
关键词
Big data; hadoop; spark; machine learning; Hadoop Distributed File System (HDFS)); mapreduce; word count; BIG DATA; PLACEMENT STRATEGY; ANALYTICS; IMPACT;
D O I
10.14569/IJACSA.2021.0120495
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the advent of the Big Data explosion due to the Information Technology (IT) revolution during the last few decades, the need for processing and analyzing the data at low cost in minimum time has become immensely challenging. The field of Big Data analytics is driven by the demand to process Machine Learning (ML) data, real-time streaming data, and graphics processing. The most efficient solutions to Big Data analysis in a distributed environment are Hadoop and Spark administered by Apache, both these solutions are open-source data management frameworks and they allow to distribute and compute the large datasets across multiple clusters of computing nodes. This paper provides a comprehensive comparison between Apache Hadoop & Apache Spark in terms of efficiency, scalability, security, cost-effectiveness, and other parameters. It describes primary components of Hadoop and Spark frameworks to compare their performance. The major conclusion is that Spark is better in terms of scalability and speed for real-time streaming applications; whereas, Hadoop is more viable for applications dealing with bigger datasets. This case study evaluates the performance of various components of Hadoop-such, MapReduce, and Hadoop Distributed File System (HDFS) by applying it to the well-known Word Count algorithm to ascertain its efficacy in terms of storage and computational time. Subsequently, it also provides an analysis of how Spark's in-line memory processing could reduce the computational time of the Word Count Algorithm.
引用
收藏
页码:778 / 788
页数:11
相关论文
共 50 条
  • [21] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
    Wei, Chih-Chiang
    Chou, Tzu-Hao
    ATMOSPHERE, 2020, 11 (08)
  • [22] Architecture of Efficient Word Processing using Hadoop MapReduce for Big Data Applications
    Mandal, Bichitra
    Sahoo, Ramesh Kumar
    Sethi, Srinivas
    PROCEEDINGS 2015 INTERNATIONAL CONFERENCE ON MAN AND MACHINE INTERFACING (MAMI), 2015,
  • [23] An Efficient Improved Join Algorithm Using Map Reduce in Hadoop
    Patel, Warish D.
    Vaghela, Dineshkumar B.
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROPAGATION AND COMPUTER TECHNOLOGY (ICSPCT 2014), 2014, : 263 - 272
  • [24] Crime Data Analysis Using Pig with Hadoop
    Jain, Arushi
    Bhatnagar, Vishal
    1ST INTERNATIONAL CONFERENCE ON INFORMATION SECURITY & PRIVACY 2015, 2016, 78 : 571 - 578
  • [25] Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster
    Prasad, Kashi Sai
    Pasupathy, S.
    JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2018, 13 (05): : 164 - 176
  • [26] Hadoop-MCC: Efficient Multiple Compound Comparison Algorithm Using Hadoop
    Hua, Guan-Jie
    Hung, Che-Lun
    Tang, Chuan Yi
    COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2018, 21 (02) : 84 - 92
  • [27] Performance Comparison of a Parallel Recommender Algorithm across three Hadoop-based Frameworks
    Diedhiou, Christina
    Carpenter, Bryan
    Shafi, Aamir
    Sarkar, Soumabha
    Esmeli, Ramazan
    Gadsdon, Ryan
    2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 380 - 387
  • [28] MapReduce Based Analysis of Sample Applications Using Hadoop
    Ghazi, Mohd Rehan
    Raghava, N. S.
    APPLICATIONS OF COMPUTING AND COMMUNICATION TECHNOLOGIES, ICACCT 2018, 2018, 899 : 34 - 44
  • [29] Movie Dataset Analysis using Hadoop-Hive
    Ashwitha, T. A.
    Rodrigues, Anisha P.
    Chiplunkar, Niranjan N.
    2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 181 - 186
  • [30] LOG ANALYSIS IN CLOUD COMPUTING ENVIRONMENT WITH HADOOP AND SPARK
    Lin, Xiuqin
    Wang, Peng
    Wu, Bin
    2013 5TH IEEE INTERNATIONAL CONFERENCE ON BROADBAND NETWORK & MULTIMEDIA TECHNOLOGY (IC-BNMT), 2013, : 273 - 276