Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

被引:0
|
作者
Sau, Arkaprabha [1 ,2 ]
Phadikar, Santanu [1 ]
Bhakta, Ishita [3 ]
机构
[1] Maulana Abul Kalam Azad Univ Technol, Dept Comp Sci & Engn, BF-142,Sect 1, Kolkata 700064, W Bengal, India
[2] Minist Labour & Employment, Reg Labour Inst, DGFASLI, Govt India, Kanpur, Uttar Pradesh, India
[3] Techno Main Salt Lake, Dept Informat Technol, Kolkata, W Bengal, India
关键词
Anomaly; Data entry errors; DBSCAN; Machine learning; Public health; Unsupervised learning; ANOMALY DETECTION;
D O I
10.1186/s12982-024-00245-3
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Data entry errors in large-scale public health surveys can undermine the effectiveness of data-driven interventions. Therefore, identifying these data entry errors is crucial for public health experts. In large-scale public health surveys, manually verifying the accuracy of every data point by domain experts is nearly impossible. This study evaluates unsupervised machine learning algorithms for detecting these errors, focusing on the 'weight' parameter in the Annual Health Survey (AHS) dataset. The AHS, conducted by the Ministry of Health and Family Welfare, Government of India, in collaboration with the Registrar General of India, is a large-scale, stratified, household-level survey targeting maternal and child health across nine states in India. The dataset is freely available on the Open Government Data (OGD) Platform of India for public health research. In this study, five algorithms-DBSCAN, K-Means, Gaussian Mixture Model (GMM), Isolation Forest (IF), and One-Class SVM (1C-SVM) were applied to detect erroneous data entries. The evaluation process involved comprehensive preprocessing and feature engineering to optimize detection capabilities. Performance metrics such as precision, recall, accuracy, false anomaly, and missed anomaly rates were used to assess each algorithm. Among these, DBSCAN demonstrated superior performance, achieving a recall of 94.7% and a precision of 81.9%, making it highly effective for this task. The findings underscore the potential of unsupervised machine learning in automating the detection of data entry errors, thereby improving the integrity of public health data. This research contributes to the advancement of precision public health, supporting more accurate and reliable evidence-based decision-making and policy formulation.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] MMSVC: An Efficient Unsupervised Learning Approach for Large-Scale Datasets
    Gu, Hong
    Zhao, Guangzhou
    Zhang, Jianliang
    LIFE SYSTEM MODELING AND INTELLIGENT COMPUTING, 2010, 6330 : 1 - 9
  • [2] MMSVC: An efficient unsupervised learning approach for large-scale datasets
    Gu, Hong
    Zhao, Guangzhou
    Zhang, Jianliang
    NEUROCOMPUTING, 2012, 98 : 114 - 122
  • [3] Efficient Machine Learning On Large-Scale Graphs
    Erickson, Parker
    Lee, Victor E.
    Shi, Feng
    Tang, Jiliang
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4788 - 4789
  • [4] Defining disease endophenotypes in neovascular AMD by unsupervised machine learning of large-scale OCT data
    Seeboeck, Philipp
    Waldstein, Sebastian M.
    Donner, Rene
    Gerendas, Bianca S.
    Sadeghipour, Amir
    Osborne, Aaron
    Schmidt-Erfurth, Ursula
    Langs, Georg
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2017, 58 (08)
  • [5] Twitter Sentiment Analysis for Large-Scale Data: An Unsupervised Approach
    Pandarachalil, Rafeeque
    Sendhilkumar, Selvaraju
    Mahalakshmi, G. S.
    COGNITIVE COMPUTATION, 2015, 7 (02) : 254 - 262
  • [6] Twitter Sentiment Analysis for Large-Scale Data: An Unsupervised Approach
    Rafeeque Pandarachalil
    Selvaraju Sendhilkumar
    G. S. Mahalakshmi
    Cognitive Computation, 2015, 7 : 254 - 262
  • [7] Humanization of antibodies using a machine learning approach on large-scale repertoire data
    Marks, Claire
    Hummer, Alissa M.
    Chin, Mark
    Deane, Charlotte M.
    BIOINFORMATICS, 2021, 37 (22) : 4041 - 4047
  • [8] Outlier Ranking for Large-Scale Public Health Data
    Joshi, Ananya
    Townes, Tina
    Gormley, Nolan
    Neureiter, Luke
    Rosenfeld, Roni
    Wilder, Bryan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22176 - 22184
  • [9] Automatic Detection of Large-scale Flux Ropes and Their Geoeffectiveness with a Machine-learning Approach
    Pal, Sanchita
    dos Santos, Luiz F. G.
    Weiss, Andreas J.
    Narock, Thomas
    Narock, Ayris
    Nieves-Chinchilla, Teresa
    Jian, Lan K.
    Good, Simon W.
    ASTROPHYSICAL JOURNAL, 2024, 972 (01):
  • [10] 21 000 birds in 4.5 h: efficient large-scale seabird detection with machine learning
    Kellenberger, Benjamin
    Veen, Thor
    Folmer, Eelke
    Tuia, Devis
    REMOTE SENSING IN ECOLOGY AND CONSERVATION, 2021, 7 (03) : 445 - 460