Visualization and Adaptive Subsetting of Earth Science Data in HDFS A Novel Data Analysis Strategy with Hadoop and Spark

被引:3
作者
Yang, Xi [1 ]
Liu, Si [1 ]
Feng, Kun [1 ]
Zhou, Shujia [2 ]
Sun, Xian-He [1 ]
机构
[1] IIT, Dept Comp Sci, Chicago, IL 60616 USA
[2] Northrop Grumman Informat Technol, Mclean, VA USA
来源
PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCES ON BIG DATA AND CLOUD COMPUTING (BDCLOUD 2016) SOCIAL COMPUTING AND NETWORKING (SOCIALCOM 2016) SUSTAINABLE COMPUTING AND COMMUNICATIONS (SUSTAINCOM 2016) (BDCLOUD-SOCIALCOM-SUSTAINCOM 2016) | 2016年
关键词
Visualization; R; MapReduce; Hadoop; Spark;
D O I
10.1109/BDCloud-SocialCom-SustainCom.2016.24
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data analytics becomes increasingly important in big data applications. Adaptively subsetting large amounts of data to extract the interesting events such as the centers of hurricane or thunderstorm, statistically analyzing and visualizing the subset data, is an effective way to analyze ever-growing data. This is particularly crucial for analyzing Earth Science data, such as extreme weather. The Hadoop ecosystem (i.e., HDFS, MapReduce, Hive) provides a cost-efficient big data management environment and is being explored for analyzing big Earth Science data. Our study investigates the potential of a MapReduce-like paradigm to perform statistical calculations, and utilizes the calculated results to subset as well as visualize data in a scalable and efficient way. RHadoop and SparkR are deployed to enable R to access and process data in parallel with Hadoop and Spark, respectively. The regular R libraries and tools are utilized to create and manipulate images. Statistical calculations, such as maximum and average variable values, are carried with R or SQL. We have developed a strategy to conduct query and visualization within one phase, and thus significantly improve the overall performance in a scalable way. The technical challenges and limitations of both Hadoop and Spark platforms for R are also discussed.
引用
收藏
页码:89 / 96
页数:8
相关论文
共 25 条
[1]   Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce [J].
Aji, Ablimit ;
Wang, Fusheng ;
Vo, Hoang ;
Lee, Rubao ;
Liu, Qiaoling ;
Zhang, Xiaodong ;
Saltz, Joel .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (11) :1009-1020
[2]  
[Anonymous], 2012, NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
[3]  
[Anonymous], 2015, CIDR
[4]  
[Anonymous], P ACM SIGMOD INT C M
[5]  
[Anonymous], P IEEE INT C BIG DAT
[6]  
Apache Software Fundation, 2014, AP HAD PROJ
[7]  
Calaway R., doParallel: Foreach Parallel Adaptor for the 'parallel' Package
[8]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[9]  
Eldawy A., 2016, P 32 IEEE INT C DAT
[10]  
Eldawy A., 2015, P 31 IEEE INT C DAT