Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
来源
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018 | 2018年 / 872卷
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [21] Amalur: The Convergence of Data Integration and Machine Learning
    Li, Ziyu
    Sun, Wenbo
    Zhan, Danning
    Kang, Yan
    Chen, Lydia
    Bozzon, Alessandro
    Hai, Rihan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7353 - 7367
  • [22] Data Integration and Machine Learning: A Natural Synergy
    Dong, Xin Luna
    Rekatsinas, Theodoros
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1645 - 1650
  • [23] A DaQL to Monitor Data Quality in Machine Learning Applications
    Ehrlinger, Lisa
    Haunschmid, Verena
    Palazzini, Davide
    Lettner, Christian
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, 2019, 11706 : 227 - 237
  • [24] Overview and Importance of Data Quality for Machine Learning Tasks
    Jain, Abhinav
    Patel, Hima
    Nagalapatti, Lokesh
    Gupta, Nitin
    Mehta, Sameep
    Guttula, Shanmukha
    Mujumdar, Shashank
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3561 - 3562
  • [25] A Survey on Data Quality Dimensions and Tools for Machine Learning
    Zhou, Yuhan
    Tu, Fengjiao
    Sha, Kewei
    Ding, Junhua
    Chen, Haihua
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 120 - 131
  • [26] Improving the Quality of Art Market Data Using Linked Open Data and Machine Learning
    Filipiak, Dominik
    Filipowska, Agata
    BUSINESS INFORMATION SYSTEMS WORKSHOPS, BIS 2016, 2017, 263 : 418 - 428
  • [27] Effective Outlier Detection for Ensuring Data Quality in Flotation Data Modelling Using Machine Learning (ML) Algorithms
    Lartey, Clement
    Liu, Jixue
    Asamoah, Richmond K.
    Greet, Christopher
    Zanin, Massimiliano
    Skinner, William
    MINERALS, 2024, 14 (09)
  • [28] Application of machine learning in ocean data
    Ranran Lou
    Zhihan Lv
    Shuping Dang
    Tianyun Su
    Xinfang Li
    Multimedia Systems, 2023, 29 : 1815 - 1824
  • [29] Application of Machine Learning for Cytometry Data
    Hu, Zicheng
    Bhattacharya, Sanchita
    Butte, Atul J.
    FRONTIERS IN IMMUNOLOGY, 2022, 12
  • [30] Data Evaluation and Enhancement for Quality Improvement of Machine Learning
    Chen, Haihua
    Chen, Jiangping
    Ding, Junhua
    IEEE TRANSACTIONS ON RELIABILITY, 2021, 70 (02) : 831 - 847