Cleaning Big Data Streams: A Systematic Literature Review

被引:7
作者
Alotaibi, Obaid [1 ,2 ]
Pardede, Eric [2 ]
Tomy, Sarath [3 ]
Bagui, Sikha
Iacono, Mauro
机构
[1] Shaqra Univ, Coll Arts & Sci, Dept Comp Sci, Sajir Campus, Sajir City 11951, Saudi Arabia
[2] La Trobe Univ, Sch Engn & Math Sci, Dept Comp Sci & Informat Technol, Melbourne Campus, Melbourne, Vic 3086, Australia
[3] La Trobe Univ, Sch Engn & Math Sci, Dept Comp Sci & Informat Technol, Bendigo Campus, Flora Hill, Vic 3552, Australia
关键词
clean; big data; stream; machine learning; deep learning; artificial intelligence; missing value; outliers; duplicate data; irrelevant data; OUTLIER DETECTION; ANOMALY DETECTION; FRAMEWORK;
D O I
10.3390/technologies11040101
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
In today's big data era, cleaning big data streams has become a challenging task because of the different formats of big data and the massive amount of big data which is being generated. Many studies have proposed different techniques to overcome these challenges, such as cleaning big data in real time. This systematic literature review presents recently developed techniques that have been used for the cleaning process and for each data cleaning issue. Following the PRISMA framework, four databases are searched, namely IEEE Xplore, ACM Library, Scopus, and Science Direct, to select relevant studies. After selecting the relevant studies, we identify the techniques that have been utilized to clean big data streams and the evaluation methods that have been used to examine their efficiency. Also, we define the cleaning issues that may appear during the cleaning process, namely missing values, duplicated data, outliers, and irrelevant data. Based on our study, the future directions of cleaning big data streams are identified.
引用
收藏
页数:24
相关论文
共 91 条
  • [1] A Correlation-Based Anomaly Detection Model for Wireless Body Area Networks Using Convolutional Long Short-Term Memory Neural Network
    Albattah, Albatul
    Rassam, Murad A.
    [J]. SENSORS, 2022, 22 (05)
  • [2] Alghushairy Omar, 2020, ICCDA 2020: Proceedings of the 2020 4th International Conference on Compute and Data Analysis, P38, DOI 10.1145/3388142.3388160
  • [3] Alsini Raed, 2020, 2020 International Conference on Computational Science and Computational Intelligence (CSCI), P369, DOI 10.1109/CSCI51800.2020.00069
  • [4] Collective Anomaly Detection Using Big Data Distributed Stream Analytics
    Amen, Bakhtiar
    Grigoris, Antoniou
    [J]. 2018 14TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2018, : 188 - 195
  • [5] [Anonymous], PRISMA flow diagram generator [Online tool]
  • [6] [Anonymous], 2018, P 2018 INT C INTELLI, DOI DOI 10.1109/ICITBS.2018.00078
  • [7] An efficient approach for detecting anomalous events in real-time weather datasets
    Arora, Shruti
    Rani, Rinkle
    Saxena, Nitin
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (05)
  • [8] Belacel N., 2022, 2022 IEEE INT C BIG, P3348
  • [9] Improving outliers detection in data streams using LiCS and voting
    Benjelloun, Fatima-Zahra
    Oussous, Ahmed
    Bennani, Amine
    Belfkih, Samir
    Lahcen, Ayoub Ait
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2021, 33 (10) : 1177 - 1185
  • [10] A linear programming-based framework for handling missing data in multi-granular data warehouses
    Bimonte, Sandro
    Ren, Libo
    Koueya, Nestor
    [J]. DATA & KNOWLEDGE ENGINEERING, 2020, 128