Using Unsupervised Machine Learning for Data Quality. Application to Financial Governmental Data Integration

被引:0
|
作者
Necba, Hanae [1 ]
Rhanoui, Maryem [1 ,2 ]
El Asri, Bouchra [1 ]
机构
[1] Mohammed V Univ, ENSIAS, Rabat IT Ctr, IMS Team,ADMIR Lab, Rabat, Morocco
[2] Sch Informat Sci, LYRICA Lab, Meridian Team, Rabat, Morocco
来源
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018 | 2018年 / 872卷
关键词
Machine Learning; Data quality; Name matching; Affinity propagation; Levenshtein distance; Clustering; Unsupervised learning; Scikit learn; Data integration problems; BIG DATA; MANAGEMENT;
D O I
10.1007/978-3-319-96292-4_16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data quality, means, that data are correct, reliable, accurate and valid to be used and to serve its purpose in a given context. Data quality is crucial to make right decisions and reports in every organization. However, huge volume of data produced by organizations or redundant and heterogeneous data integration make manual methods of data quality control difficult, for that using intelligent technologies like Machine Learning is essential to ensure data quality across the organization. In this paper, we present an unsupervised learning approach that aims to match similar names and group them in same cluster to correct data therefore ensure data quality. Our approach is validated in the context of financial data quality of taxpayers using scikit learn the machine learning library for the Python programming language.
引用
收藏
页码:197 / 209
页数:13
相关论文
共 50 条
  • [31] Application of machine learning in ocean data
    Lou, Ranran
    Lv, Zhihan
    Dang, Shuping
    Su, Tianyun
    Li, Xinfang
    MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1815 - 1824
  • [32] Topological data analysis via unsupervised machine learning for recognizing atmospheric river patterns on flood detection
    Ohanuba, F. O.
    Ismail, M. T.
    Ali, M. K. Majahar
    SCIENTIFIC AFRICAN, 2021, 13
  • [33] Data mining application in prosecution committee for unsupervised learning
    Liu, P
    Zhu, JX
    Liu, LJ
    Li, YH
    Zhang, XF
    2005 INTERNATIONAL CONFERENCE ON SERVICES SYSTEMS AND SERVICES MANAGEMENT, VOLS 1 AND 2, PROCEEDINGS, 2005, : 1061 - 1064
  • [34] A Two Step Unsupervised Learning Approach to Diagnose Machine Fault Using Big Data
    Sharmila, V. J.
    Florinabel, D. Jemi
    INFORMATION TECHNOLOGY AND CONTROL, 2022, 51 (01): : 78 - 85
  • [35] Quality assurance strategies for machine learning applications in big data analytics: an overview
    Ogrizovic, Mihajlo
    Draskovic, Drazen
    Bojic, Dragan
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [36] A Layered Quality Framework for Machine Learning-driven Data and Information Models
    Azimi, Shelernaz
    Pahl, Claus
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS (ICEIS), VOL 1, 2020, : 579 - 587
  • [37] Predicting Future Earnings Changes Using Machine Learning and Detailed Financial Data
    Chen, Xi
    Cho, Yang Ha
    Dou, Yiwei
    Lev, Baruch
    JOURNAL OF ACCOUNTING RESEARCH, 2022, 60 (02) : 467 - 515
  • [38] The effects of data quality on machine learning performance on tabular data
    Mohammed, Sedir
    Budach, Lukas
    Feuerpfeil, Moritz
    Ihde, Nina
    Nathansen, Andrea
    Noack, Nele
    Patzlaff, Hendrik
    Naumann, Felix
    Harmouch, Hazar
    INFORMATION SYSTEMS, 2025, 132
  • [39] Comparison of supervised and unsupervised machine learning techniques for UXO classification using EMI data
    Bijamov, Alex
    Shubitidze, Fridon
    Fernandez, Juan Pablo
    Shamatava, Irma
    Barrowes, Benjamin E.
    O'Neill, Kevin
    DETECTION AND SENSING OF MINES, EXPLOSIVE OBJECTS, AND OBSCURED TARGETS XVI, 2011, 8017
  • [40] Machine Learning Enhanced Framework for Big Data Modeling with Application in Industry 4.0
    Kazbekova, Gulnur
    Ismagulova, Zhuldyz
    Zhussipbek, Botagoz
    Abdrazakh, Yntymak
    Iskendirova, Gulzipa
    Toilybayeva, Nurgul
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (03) : 308 - 318