G-Hadoop: MapReduce across distributed data centers for data-intensive computing

被引:243
作者
Wang, Lizhe [1 ,2 ]
Tao, Jie [3 ]
Ranjan, Rajiv [4 ]
Marten, Holger [3 ]
Streit, Achim [3 ,6 ]
Chen, Jingying [5 ]
Chen, Dan [1 ]
机构
[1] China Univ Geosci, Sch Comp, Wuhan 430074, Peoples R China
[2] Chinese Acad Sci, Ctr Earth Observat & Digital Earth, Beijing 100864, Peoples R China
[3] Karlsruhe Inst Technol, Steinbuch Ctr Comp, D-76021 Karlsruhe, Germany
[4] CSIRO, ICT Ctr, Informat Engn Lab, Canberra, ACT, Australia
[5] Cent China Normal Univ, Natl Engn Ctr E Learning, Beijing, Peoples R China
[6] Karlsruhe Inst Technol, Inst Telemat, Dept Informat, D-76021 Karlsruhe, Germany
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2013年 / 29卷 / 03期
基金
中国国家自然科学基金;
关键词
Cloud computing; Massive data processing; Data-intensive computing; Hadoop; MapReduce; CLOUD;
D O I
10.1016/j.future.2012.09.001
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data is processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:739 / 750
页数:12
相关论文
共 50 条
[41]   Distributed data structure templates for data-intensive remote sensing applications [J].
Ma, Yan ;
Wang, Lizhe ;
Liu, Dingsheng ;
Yuan, Tao ;
Liu, Peng ;
Zhang, Wanfeng .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (12) :1784-1797
[42]   Data-intensive computing in the 21st century [J].
Gorton, Ian ;
Greenfield, Paul ;
Szalay, Alex ;
Williams, Roy .
COMPUTER, 2008, 41 (04) :30-32
[43]   Clustering on Big Data Using Hadoop MapReduce [J].
Akthar, Nadeem ;
Ahamad, Mohd Vasim ;
Khan, Shahbaz .
2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, :789-795
[44]   MapReduce Task Scheduling in Heterogeneous Geo-Distributed Data Centers [J].
Li, Xiaoping ;
Chen, Fuchao ;
Ruiz, Ruben ;
Zhu, Jie .
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (06) :3317-3329
[45]   Leveraging Data Intensive Applications on a Pervasive Computing Platform: the case of MapReduce [J].
Steffenel, Luiz Angelo ;
Pinheiro, Manuele Kirch .
6TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT-2015), THE 5TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT-2015), 2015, 52 :1034-1039
[46]   All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids [J].
Moretti, Christopher ;
Bui, Hoang ;
Hollingsworth, Karen ;
Rich, Brandon ;
Flynn, Patrick ;
Thain, Douglas .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2010, 21 (01) :33-46
[47]   An Inter-Framework Cache for Diverse Data-Intensive Computing Environments [J].
Wang, Chun-Yu ;
Huang, Tzu-En ;
Huang, Yu-Tang ;
Chang, Jyh-Biau ;
Shieh, Ce-Kuen .
2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, :944-949
[48]   GPU-In-Hadoop: Enabling MapReduce Across Distributed Heterogeneous Platforms [J].
Zhu, Jie ;
Li, Juanjuan ;
Hardesty, Erikson ;
Jiang, Hai ;
Li, Kuan-Ching .
2014 IEEE/ACIS 13TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2014, :315-320
[49]   Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies [J].
Tsai, Chih-Fong ;
Lin, Wei-Chao ;
Ke, Shih-Wen .
JOURNAL OF SYSTEMS AND SOFTWARE, 2016, 122 :83-92
[50]   Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines [J].
Bongo, Lars Ailo ;
Pedersen, Edvard ;
Ernstsen, Martin .
COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS, CIBB 2014, 2015, 8623 :259-272