Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
来源
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018 | 2018年 / 872卷
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [1] Data Integration using Machine Learning
    Birgersson, Marcus
    Hansson, Gustav
    Franke, Ulrik
    2016 IEEE 20TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING WORKSHOP (EDOCW), 2016, : 313 - 322
  • [2] Data-driven track geometry fault localisation using unsupervised machine learning
    Popov, K.
    De Bold, R.
    Chai, H. -K.
    Forde, M. C.
    Ho, C. L.
    Hyslip, J. P.
    Kashani, H. F.
    Kelly, R.
    Hsu, S. S.
    Rippin, M.
    CONSTRUCTION AND BUILDING MATERIALS, 2023, 377
  • [3] Quality of Data in Machine Learning
    Kariluoto, Antti
    Kultanen, Joni
    Soininen, Jukka
    Parnanen, Arto
    Abrahamsson, Pekka
    2021 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C 2021), 2021, : 216 - 221
  • [4] Representing molecular and materials data for unsupervised machine learning
    Swann, E.
    Sun, B.
    Cleland, D. M.
    Barnard, A. S.
    MOLECULAR SIMULATION, 2018, 44 (11) : 905 - 920
  • [5] Data Quality for Machine Learning Tasks
    Gupta, Nitin
    Mujumdar, Shashank
    Patel, Hima
    Masuda, Satoshi
    Panwar, Naveen
    Bandyopadhyay, Sambaran
    Mehta, Sameep
    Guttula, Shanmukha
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4040 - 4041
  • [6] Detecting Anomalies in Financial Data Using Machine Learning Algorithms
    Bakumenko, Alexander
    Elragal, Ahmed
    SYSTEMS, 2022, 10 (05):
  • [7] Data Oriented Financial Analysis using Machine Learning Methods
    Altan, Cisem
    Kalayci, Sacide
    Koroglu, Bilge
    2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2020, : 37 - 41
  • [8] Financial Big data Visualization: A Machine Learning Perspective
    Dong, Alice Xiaodan
    Huang, Weidong
    Wang, Jitong
    17TH INTERNATIONAL SYMPOSIUM ON VISUAL INFORMATION COMMUNICATION AND INTERACTION, VINCI 2024, 2024,
  • [9] A MACHINE LEARNING APPROACH FOR DATA QUALITY CONTROL OF EARTH OBSERVATION DATA MANAGEMENT SYSTEM
    Hau, Weiguo
    Jochum, Matthew
    IGARSS 2020 - 2020 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2020, : 3101 - 3103
  • [10] Machine Learning for Medical Data Integration
    Mueller, Armin
    Christmann, Lara-Sophie
    Kohler, Severin
    Eils, Roland
    Prasser, Fabian
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 691 - 695