Efficient Cache Update for In-Memory Cluster Computing with Spark

被引:3
作者
Ho, Li-Yung [1 ]
Wu, Jan-Jan [2 ]
Liu, Pangfeng [3 ]
Shih, Chia-Chun [4 ]
Huang, Chi-Chang [4 ]
Huang, Chao-Wen [4 ]
机构
[1] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei, Taiwan
[2] Acad Sinica, Inst Informat Sci, Res Ctr Informat Technol Innovat, Taipei, Taiwan
[3] Natl Taiwan Univ, Grad Inst Networking & Multimedia, Dept Comp Sci & Informat Engn, Taipei, Taiwan
[4] Chunghwa Telecom Labs, Taipei, Taiwan
来源
2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID) | 2017年
关键词
big data computing; cache update; Spark; resilient distributed dataset; telecom billing system;
D O I
10.1109/CCGRID.2017.21
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a scalable and efficient cache update technique to improve the performance of in-memory cluster computing in Spark, a popular open-source system for big data computing. Although the memory cache speeds up data processing in Spark, its data immutability constraint requires reloading the whole RDD when part of its data is updated. Such constraint makes the RDD update inefficient. To address this problem, we divide an RDD into partitions, and propose the partial-update RDD (PRDD) method to enable users to replace individual partition(s) of an RDD. We devise two solutions to the RDD partition problem - a dynamic programming algorithm and a nonlinear programming method. Experiment results suggest that, PRDD achieves 4.32x speedup when compared with the original RDD in Spark. We apply PRDD to a billing system for Chunghwa Telecomm, the largest telecommunication company in Taiwan. Our result shows that the PRDD based billing system outperforms the original billing system in CHT by a factor of 24x in throughput. We also evaluate PRDD using the TPC-H benchmark, which also yields promising result.
引用
收藏
页码:21 / 30
页数:10
相关论文
共 17 条
[1]  
[Anonymous], 2010, P USENIX WORKSH HOT
[2]  
[Anonymous], J COMPUTING, DOI DOI 10.1287/IJOC.6.2.207
[3]  
[Anonymous], 1998, P JOINT C NEW METH L, DOI DOI 10.3115/1603899.1603924
[4]   Spark SQL: Relational Data Processing in Spark [J].
Armbrust, Michael ;
Xin, Reynold S. ;
Lian, Cheng ;
Huai, Yin ;
Liu, Davies ;
Bradley, Joseph K. ;
Meng, Xiangrui ;
Kaftan, Tomer ;
Franklint, Michael J. ;
Ghodsi, Ali ;
Zaharia, Matei .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394
[5]   DICE: Quality-Driven Development of Data-Intensive Cloud Applications [J].
Casale, G. ;
Ardagna, D. ;
Artac, M. ;
Barbier, F. ;
Di Nitto, E. ;
Henry, A. ;
Iuhasz, G. ;
Joubert, C. ;
Merseguer, J. ;
Munteanu, V. I. ;
Perez, J. F. ;
Petcu, D. ;
Rossi, M. ;
Sheridan, C. ;
Spais, I. ;
Vladusic, D. .
2015 IEEE/ACM 7TH INTERNATIONAL WORKSHOP ON MODELING IN SOFTWARE ENGINEERING, 2015, :78-83
[6]   Design of a health care architecture for medical data interoperability and application integration [J].
Catley, C ;
Frize, M .
SECOND JOINT EMBS-BMES CONFERENCE 2002, VOLS 1-3, CONFERENCE PROCEEDINGS: BIOENGINEERING - INTEGRATIVE METHODOLOGIES, NEW TECHNOLOGIES, 2002, :1952-1953
[7]   Selection and replacement algorithms for memory performance improvement in Spark [J].
Duan, Mingxing ;
Li, Kenli ;
Tang, Zhuo ;
Xiao, Guoqing ;
Li, Keqin .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (08) :2473-2486
[8]   Tools and approaches for developing data-intensive Web applications: A survey [J].
Fraternali, P .
ACM COMPUTING SURVEYS, 1999, 31 (03) :227-263
[9]   Flexible and efficient workflow deployment of data-intensive applications on grids with MOTEUR [J].
Glatard, Tristan ;
Montagnat, Johan ;
Lingrand, Diane ;
Pennec, Xavier .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2008, 22 (03) :347-360
[10]  
Johnson T., 1994, P 20 INT C VER LARG, P439