Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data

被引:14
作者
Hu, Fei [1 ,2 ]
Xu, Mengchao [1 ,2 ]
Yang, Jingchao [1 ,2 ]
Liang, Yanshou [1 ,2 ]
Cui, Kejin [1 ,2 ]
Little, Michael M. [3 ]
Lynnes, Christopher S. [3 ]
Duffy, Daniel Q. [3 ]
Yang, Chaowei [1 ,2 ]
机构
[1] George Mason Univ, NSF Spatiotemporal Innovat Ctr, Fairfax, VA 22030 USA
[2] George Mason Univ, Dept Geog & GeoInformat Sci, Fairfax, VA 22030 USA
[3] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
来源
ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION | 2018年 / 7卷 / 04期
基金
美国国家科学基金会;
关键词
big data; data container; geospatial raster data management; GIS; SYSTEM; PERFORMANCE;
D O I
10.3390/ijgi7040144
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.
引用
收藏
页数:22
相关论文
共 42 条
  • [1] Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
    Aji, Ablimit
    Wang, Fusheng
    Vo, Hoang
    Lee, Rubao
    Liu, Qiaoling
    Zhang, Xiaodong
    Saltz, Joel
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (11): : 1009 - 1020
  • [2] On the Application and Performance of MongoDB for Climate Satellite Data
    Ameri, Parinaz
    Grabowski, Udo
    Meyer, Joerg
    Streit, Achim
    [J]. 2014 IEEE 13TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM), 2014, : 652 - 659
  • [3] Amirian P, 2014, LECT NOTES COMPUT SC, V8583, P678, DOI 10.1007/978-3-319-09156-3_47
  • [4] Aniceto Rodrigo, 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), P8, DOI 10.1109/BIBM.2014.6999304
  • [5] [Anonymous], 2017, AGGREGATION
  • [6] [Anonymous], 2013, MongoDB: The Definitive Guide
  • [7] [Anonymous], 2012, P 1 ACM SIGSPATIAL I
  • [8] [Anonymous], 2017, INDEXES
  • [9] Baumann P., 1998, SIGMOD Record, V27, P575, DOI 10.1145/276305.276386
  • [10] Baumann P, 2014, LECT NOTES COMPUT SC, V8163, P94, DOI 10.1007/978-3-642-53974-9_9