DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

被引：0

作者：

Shang, Pengju ^{[1
]}

Xiao, Qiangju ^{[1
]}

Wang, Jun ^{[1
]}

机构：

[1] Univ Cent Florida, Orlando, FL 32816 USA

来源：

2012 DIGEST ASIA-PACIFIC MAGNETIC RECORDING CONFERENCE (APMRC) | 2012年

基金：

美国国家科学基金会;

关键词：

MapReduce; Hadoop; Data-intensive; Data layout;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoop's default random placement.

引用

页数：8

共 50 条

[31] Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters
Li, Tao
He, Shuibing
Chen, Ping
Yang, Siling
Yin, Yanlong
Xu, Cheng
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2020, 29 (16)
[32] RENDA: Resource and Network Aware Data Placement Algorithm for Periodic Workloads in Cloud
Thakkar, Hiren Kumar
Sahoo, Prasan Kumar
Veeravalli, Bharadwaj
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (12) : 2906 - 2920
[33] DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters
Jin, Jiahui
An, Qi
Zhou, Wei
Tang, Jiakai
Xiong, Runqun
APPLIED SCIENCES-BASEL, 2018, 8 (11):
[34] An Identification Algorithm in Grouping and Paralleling for Data-Intensive RFID Systems
Duan Litian
Zizhong, Wang John
Fu, Duan
BIG DATA COMPUTING AND COMMUNICATIONS, 2015, 9196 : 337 - 346
[35] Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
Chen, C. L. Philip
Zhang, Chun-Yang
INFORMATION SCIENCES, 2014, 275 : 314 - 347
[36] ExoApp: Performance Evaluation of Data-Intensive Applications on ExoGENI
Yu, Ze
Liu, Xinxin
Li, Min
Liu, Kaikai
Li, Xiaolin
2013 SECOND GENI RESEARCH AND EDUCATIONAL EXPERIMENT WORKSHOP (GREE), 2013, : 25 - 28
[37] A Survey of Semantics-Aware Performance Optimization for Data-Intensive Computing
Rao, Bingbing
Wang, Liqang
2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 81 - 88
[38] An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters
Zhao, Hui
Yang, Shuqiang
Fan, Hua
Chen, Zhikun
Xu, Jinghu
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2654 - 2662
[39] RING: NUMA-aware Message-batching Runtime for Data-intensive Applications
Meng, Ke
Tan, Guangming
2017 IEEE 23RD INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2017, : 368 - 375
[40] MapReduce Across Distributed Clusters for Data-intensive Applications
Wang, Lizhe
Tao, Jie
Marten, Holger
Streit, Achim
Khan, Samee U.
Kolodziej, Joanna
Chen, Dan
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2004 - 2011

← 1 2 3 4 5 →