DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

被引:0
|
作者
Shang, Pengju [1 ]
Xiao, Qiangju [1 ]
Wang, Jun [1 ]
机构
[1] Univ Cent Florida, Orlando, FL 32816 USA
来源
2012 DIGEST ASIA-PACIFIC MAGNETIC RECORDING CONFERENCE (APMRC) | 2012年
基金
美国国家科学基金会;
关键词
MapReduce; Hadoop; Data-intensive; Data layout;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoop's default random placement.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications With Interest Locality
    Wang, Jun
    Xiao, Qiangju
    Yin, Jiangling
    Shang, Pengju
    IEEE TRANSACTIONS ON MAGNETICS, 2013, 49 (06) : 2514 - 2520
  • [2] A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for Hadoop
    Wu, Jia-xuan
    Zhang, Chang-sheng
    Zhang, Bin
    Wang, Peng
    MICROPROCESSORS AND MICROSYSTEMS, 2016, 47 : 161 - 169
  • [3] DPPACS: A Novel Data Partitioning and Placement Aware Computation Scheduling Scheme for Data-Intensive Cloud Applications
    Reddy, K. Hemant Kumar
    Roy, Diptendu Sinha
    COMPUTER JOURNAL, 2016, 59 (01) : 64 - 82
  • [4] CLUST - Grouping Aware Data Placement for Improving the Performance of Large-Scale Data Management System
    Vengadeswaran, Shanmugasundaram
    Balasundaram, Sadhu Ramakrishnan
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 1 - 9
  • [5] A data placement strategy for data-intensive applications in cloud
    Zheng P.
    Cui L.-Z.
    Wang H.-Y.
    Xu M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2010, 33 (08): : 1472 - 1480
  • [6] BRPS: A Big Data Placement Strategy for Data Intensive Applications
    Liu, Lihui
    Song, Junping
    Wang, Haibo
    Lv, Pin
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 813 - 820
  • [7] Awan: Locality-aware Resource Manager for Geo-distributed Data-intensive Applications
    Jonathan, Albert
    Chandra, Abhishek
    Weissman, Jon
    PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 32 - 41
  • [8] EnLoc: Data Locality-aware Energy-efficient Scheduling Scheme for Cloud Data Centers
    Kaur, Kujeet
    Kumar, Neeraj
    Garg, Sahil
    Rodrigues, Joel J. P. C.
    2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2018,
  • [9] A novel approach for improving data locality of MapReduce applications in cloud environment through intelligent data placement
    Shabeera, T. P.
    Kumar, S. D. Madhu
    INTERNATIONAL JOURNAL OF SERVICES TECHNOLOGY AND MANAGEMENT, 2020, 26 (04) : 323 - 340
  • [10] A new paradigm in data intensive computing: Stork and the data-aware schedulers
    Kosar, Tevfik
    Challenges of Large Applications in Distributed Environments, Proceedings, 2006, : 5 - 12