The implementation of data storage and analytics platform for big data lake of electricity usage with spark

被引：14

作者：

Yang, Chao-Tung ^{[1
,2
,3
]}

Chen, Tzu-Yang ^{[1
]}

Kristiani, Endah ^{[4
,5
]}

Wu, Shyhtsun Felix ^{[6
]}

机构：

[1] Tunghai Univ, Dept Comp Sci, Taichung 407224, Taiwan

[2] Tunghai Univ, Res Ctr Smart Sustainable Circular Econ, 1727,Sec 4,Taiwan Blvd, Taichung 407224, Taiwan

[3] Tunghai Univ, Res Ctr Nanotechnol, 1727,Sec 4,Taiwan Blvd, Taichung 407224, Taiwan

[4] Tunghai Univ, Dept Ind Engn & Enterprise Informat, Taichung 407224, Taiwan

[5] Krida Wacana Christian Univ, Dept Informat, Jakarta 11470, Indonesia

[6] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA

来源：

JOURNAL OF SUPERCOMPUTING | 2021年 / 77卷 / 06期

关键词：

Big data; Data lake; Data storage; Data visualization; Electricity data; SYSTEM;

D O I：

10.1007/s11227-020-03505-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.

引用

页码：5934 / 5959

页数：26

共 35 条

[1] Storage Management in AsterixDB
Alsubaiee, Sattam
Behm, Alexander
Borkar, Vinayak
Heilbron, Zachary
Kim, Young-Seok
Carey, Michael J.
Dreseler, Markus
Li, Chen
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (10): : 841 - 852
[2] [Anonymous], 2011, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
[3] [Anonymous], 2018, J AMB INTEL HUM COMP, DOI [10.1007/s12652-018-0852-x, DOI 10.1007/S12652-018-0852-X]
[4] [Anonymous], 2018, FUTURE GENER COMPUT
[5] CoreKG: a Knowledge Lake Service
Beheshti, Amin
Benatallah, Boualem
Nouri, Reza
Tabebordbar, Alireza
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (12): : 1942 - 1945
[6] CoreDB: a Data Lake Service
Beheshti, Amin
Benatallah, Boualem
Nouri, Reza
Van Munin Chhieng
Xiong, HuangTao
Zhao, Xu
[J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2451 - 2454
[7] SCARFF: A scalable framework for streaming credit card fraud detection with spark
Carcillo, Fabrizio
Dal Pozzolo, Andrea
Le Borgne, Yann-Ael
Caelen, Olivier
Mazzer, Yannis
Bontempi, Gianluca
[J]. INFORMATION FUSION, 2018, 41 : 182 - 194
[8] Chen HC, 2012, MIS QUART, V36, P1165
[9] Chen Liu, 2015, [KIPS Transactions on Software and Data Engineering, 정보처리학회논문지. 소프트웨어 및 데이터 공학], V4, P77, DOI 10.3745/KTSDE.2015.4.2.77
[10] Chen TY, 2018, INT C FRONT COMP, P99

← 1 2 3 4 →