RHJoin: A Fast and Space-efficient Join Method for Log Processing in MapReduce

被引:0
|
作者
Tang, Dixin [1 ]
Liu, Taoying [1 ]
Liu, Hong [1 ]
Li, Wei [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
来源
2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2014年
关键词
MapReduce; Join; Log Processing; Big data; MAP-REDUCE; SYSTEM;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Equi-join is heavily used in Map Reduce-based log processing. With the rapid growth of dataset sizes, join methods on MapReduce are extensively studied recently. We find that existing join methods usually cannot get high query performance and affordable storage consumption at the same time when faced with a huge amount of log data. They either only optimize one aspect but significantly sacrifice the other or have limited applications. In this paper, after analyzing characteristics of the workloads and underlying MapReduce, we present a join method with specific optimizations for log processing called RHJoin (Repartition Hash Join) and its implementation on Hadoop. In RHJoin, reference tables are partitioned in the pre-processing step, the log table is partitioned on the map side and hash join is executed on the reduce side. The shuffle procedure of MapReduce is also optimized by removing the sort step and overlapping the execution of mappers and reducers. Comprehensive experiments show that RHJoin achieves high query performance with only a small extra storage cost, and has wide application circumstances for log processing.
引用
收藏
页码:975 / 980
页数:6
相关论文
共 5 条
  • [1] RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
    He, Yongqiang
    Lee, Rubao
    Huai, Yin
    Shao, Zheng
    Jain, Namit
    Zhang, Xiaodong
    Xu, Zhiwei
    IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 1199 - 1208
  • [2] An efficient parallel processing method for skyline queries in MapReduce
    Junsu Kim
    Myoung Ho Kim
    The Journal of Supercomputing, 2018, 74 : 886 - 935
  • [3] Efficient Snapshot KNN Join Processing for Large Data Using MapReduce
    Hu, Yupeng
    Yang, Chong
    Ji, Cun
    Xu, Yang
    Li, Xueqing
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 713 - 720
  • [4] An efficient parallel processing method for skyline queries in MapReduce
    Kim, Junsu
    Kim, Myoung Ho
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (02) : 886 - 935
  • [5] An Efficient Two-Table Join Query Processing Based on Extended Bloom Filter in MapReduce
    Wang, Junlu
    Pang, Jun
    Li, Xiaoyan
    Han, Baishuo
    Huang, Lei
    Ding, Linlin
    WEB-AGE INFORMATION MANAGEMENT, 2016, 9998 : 249 - 258