Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

被引:7
作者
Widad, Elouataoui [1 ]
Saida, Elmendili [1 ]
Gahi, Youssef [1 ]
机构
[1] Ibn Tofail Univ, Natl Sch Appl Sci, Lab Engn Sci, Kenitra 14000, Morocco
关键词
Data integrity; Big Data; Anomaly detection; Organizations; Reliability; Measurement; Data models; big data; big data quality; data quality dimensions; quality anomaly score;
D O I
10.1109/ACCESS.2023.3317354
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing reliance on Big Data analytics has highlighted the critical role of data quality in ensuring accurate and reliable results. Consequently, organizations aiming to leverage the power of Big Data recognize the crucial role of data quality as an integral component. One notable type of data quality anomaly observed in big datasets is the presence of outlier values. Detecting and addressing these outliers have become a subject of interest across diverse domains, leading to the development of numerous anomaly detection approaches. Although anomaly detection has witnessed a proliferation of practices in recent years, a significant gap remains in addressing anomalies related to the other aspects of data quality. Indeed, while most approaches focus on identifying anomalies that deviate from the expected patterns, they do not consider irregularities in data quality, such as missing, incorrect, or inconsistent data. Moreover, most of approaches are domain-correlated and lack the capability to detect anomalies in a generic manner. Thus, we aim through this paper to address this gap in the field and provide a holistic and effective solution for Big Data quality anomaly detection. To achieve this, we suggest a novel approach that allows a comprehensive detection of Big Data quality anomalies related to six quality dimensions: Accuracy, Consistency, Completeness, Conformity, Uniqueness, and Readability. Moreover, the framework allows for sophisticated detection of generic data quality anomalies through the implementation of an intelligent anomaly detection model without any correlation to a specific field. Furthermore, we introduce and measure a new metric called "Quality Anomaly Score," which refers to the degree of anomalousness of the quality anomalies of each quality dimension and the entire dataset. Through the implementation and evaluation of our framework, the suggested framework has achieved an accuracy score of up to 99.91% and an F1-score of 98.07%.
引用
收藏
页码:103306 / 103318
页数:13
相关论文
共 37 条
[1]  
Chen H., 2022, Frontiers Big Data
[2]  
Cloudera, 2023, Cloudera Data Platform (CDP)
[3]   The Impact of Big Data Quality on Sentiment Analysis Approaches [J].
El Alaoui, Imane ;
Gahi, Youssef .
10TH INT CONF ON EMERGING UBIQUITOUS SYST AND PERVAS NETWORKS (EUSPN-2019) / THE 9TH INT CONF ON CURRENT AND FUTURE TRENDS OF INFORMAT AND COMMUN TECHNOLOGIES IN HEALTHCARE (ICTH-2019) / AFFILIATED WORKOPS, 2019, 160 :803-810
[4]   Big Data Quality Metrics for Sentiment Analysis Approaches [J].
El Alaoui, Imane ;
Gahi, Youssef ;
Messoussi, Rochdi .
BDE 2019: 2019 INTERNATIONAL CONFERENCE ON BIG DATA ENGINEERING, 2019, :30-37
[5]  
El Alaoui I, 2019, 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), P126, DOI [10.1109/icccbda.2019.8725728, 10.1109/ICCCBDA.2019.8725728]
[6]  
Elouataoui Widad, 2022, Advances in Information, Communication and Cybersecurity: Proceedings of ICI2C'21. Lecture Notes in Networks and Systems (357), P110, DOI 10.1007/978-3-030-91738-8_11
[7]  
Elouataoui W., 2022, P 2 INT C BIG DAT MO, P488, DOI [10.5220/0010737400003101, DOI 10.5220/0010737400003101]
[8]  
Elouataoui W., 2022, Big Data Intelligence for Smart Applications, P1
[9]   An Advanced Big Data Quality Framework Based on Weighted Metrics [J].
Elouataoui, Widad ;
El Alaoui, Imane ;
El Mendili, Saida ;
Gahi, Youssef .
BIG DATA AND COGNITIVE COMPUTING, 2022, 6 (04)
[10]  
Elouataoui W, 2022, INT J ADV COMPUT SC, V13, P281, DOI 10.3389/fpsyg.2022.934456