Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
来源
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018 | 2018年 / 872卷
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [41] Continuous Data Quality Management for Machine Learning based Data-as-a-Service Architectures
    Azimi, Shelernaz
    Pahl, Claus
    CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 328 - 335
  • [42] Linking Human And Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine Learning
    Hagendorff, Thilo
    MINDS AND MACHINES, 2021, 31 (04) : 563 - 593
  • [43] Estimating Soil Quality Indicators Using Remote Sensing Data: An Application of Machine Learning Regression Models
    Diaz-Gonzalez, Freddy A.
    Vallejo, Victoria E.
    Vuelvas, Jose
    Patino, Diego
    2023 IEEE 6TH COLOMBIAN CONFERENCE ON AUTOMATIC CONTROL, CCAC, 2023, : 38 - 43
  • [44] A Machine Learning Solution for Data Center Thermal Characteristics Analysis
    Grishina, Anastasiia
    Chinnici, Marta
    Kor, Ah-Lian
    Rondeau, Eric
    Georges, Jean-Philippe
    ENERGIES, 2020, 13 (17)
  • [45] Unsupervised machine learning methods for polymer nanocomposites data via molecular dynamics simulation
    Chen, Zhudan
    Li, Dazi
    Wan, Haixiao
    Liu, Minghui
    Liu, Jun
    MOLECULAR SIMULATION, 2020, 46 (18) : 1509 - 1521
  • [46] An Integration of Extreme Learning Machine for Classification of Big Data
    Zhou, Guanwu
    Zhao, Yulong
    Xu, Wenju
    PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMPUTER APPLICATIONS (ICSA 2013), 2013, 92 : 81 - 86
  • [47] Machine learning for data integration in human gut microbiome
    Peishun Li
    Hao Luo
    Boyang Ji
    Jens Nielsen
    Microbial Cell Factories, 21
  • [48] Data Integration Challenges for Machine Learning in Precision Medicine
    Martinez-Garcia, Mireya
    Hernandez-Lemus, Enrique
    FRONTIERS IN MEDICINE, 2022, 8
  • [49] Machine learning for data integration in human gut microbiome
    Li, Peishun
    Luo, Hao
    Ji, Boyang
    Nielsen, Jens
    MICROBIAL CELL FACTORIES, 2022, 21 (01)
  • [50] A study on quality control using delta data with machine learning technique
    Liang, Yufang
    Wang, Zhe
    Huang, Dawei
    Wang, Wei
    Feng, Xiang
    Han, Zewen
    Song, Biao
    Wang, Qingtao
    Zhou, Rui
    HELIYON, 2022, 8 (08)