The implementation of data storage and analytics platform for big data lake of electricity usage with spark

被引:14
作者
Yang, Chao-Tung [1 ,2 ,3 ]
Chen, Tzu-Yang [1 ]
Kristiani, Endah [4 ,5 ]
Wu, Shyhtsun Felix [6 ]
机构
[1] Tunghai Univ, Dept Comp Sci, Taichung 407224, Taiwan
[2] Tunghai Univ, Res Ctr Smart Sustainable Circular Econ, 1727,Sec 4,Taiwan Blvd, Taichung 407224, Taiwan
[3] Tunghai Univ, Res Ctr Nanotechnol, 1727,Sec 4,Taiwan Blvd, Taichung 407224, Taiwan
[4] Tunghai Univ, Dept Ind Engn & Enterprise Informat, Taichung 407224, Taiwan
[5] Krida Wacana Christian Univ, Dept Informat, Jakarta 11470, Indonesia
[6] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
关键词
Big data; Data lake; Data storage; Data visualization; Electricity data; SYSTEM;
D O I
10.1007/s11227-020-03505-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.
引用
收藏
页码:5934 / 5959
页数:26
相关论文
共 35 条
  • [1] Storage Management in AsterixDB
    Alsubaiee, Sattam
    Behm, Alexander
    Borkar, Vinayak
    Heilbron, Zachary
    Kim, Young-Seok
    Carey, Michael J.
    Dreseler, Markus
    Li, Chen
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (10): : 841 - 852
  • [2] [Anonymous], 2011, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
  • [3] [Anonymous], 2018, J AMB INTEL HUM COMP, DOI [10.1007/s12652-018-0852-x, DOI 10.1007/S12652-018-0852-X]
  • [4] [Anonymous], 2018, FUTURE GENER COMPUT
  • [5] CoreKG: a Knowledge Lake Service
    Beheshti, Amin
    Benatallah, Boualem
    Nouri, Reza
    Tabebordbar, Alireza
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (12): : 1942 - 1945
  • [6] CoreDB: a Data Lake Service
    Beheshti, Amin
    Benatallah, Boualem
    Nouri, Reza
    Van Munin Chhieng
    Xiong, HuangTao
    Zhao, Xu
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2451 - 2454
  • [7] SCARFF: A scalable framework for streaming credit card fraud detection with spark
    Carcillo, Fabrizio
    Dal Pozzolo, Andrea
    Le Borgne, Yann-Ael
    Caelen, Olivier
    Mazzer, Yannis
    Bontempi, Gianluca
    [J]. INFORMATION FUSION, 2018, 41 : 182 - 194
  • [8] Chen HC, 2012, MIS QUART, V36, P1165
  • [9] Chen Liu, 2015, [KIPS Transactions on Software and Data Engineering, 정보처리학회논문지. 소프트웨어 및 데이터 공학], V4, P77, DOI 10.3745/KTSDE.2015.4.2.77
  • [10] Chen TY, 2018, INT C FRONT COMP, P99