Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

被引:0
|
作者
Sau, Arkaprabha [1 ,2 ]
Phadikar, Santanu [1 ]
Bhakta, Ishita [3 ]
机构
[1] Maulana Abul Kalam Azad Univ Technol, Dept Comp Sci & Engn, BF-142,Sect 1, Kolkata 700064, W Bengal, India
[2] Minist Labour & Employment, Reg Labour Inst, DGFASLI, Govt India, Kanpur, Uttar Pradesh, India
[3] Techno Main Salt Lake, Dept Informat Technol, Kolkata, W Bengal, India
关键词
Anomaly; Data entry errors; DBSCAN; Machine learning; Public health; Unsupervised learning; ANOMALY DETECTION;
D O I
10.1186/s12982-024-00245-3
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Data entry errors in large-scale public health surveys can undermine the effectiveness of data-driven interventions. Therefore, identifying these data entry errors is crucial for public health experts. In large-scale public health surveys, manually verifying the accuracy of every data point by domain experts is nearly impossible. This study evaluates unsupervised machine learning algorithms for detecting these errors, focusing on the 'weight' parameter in the Annual Health Survey (AHS) dataset. The AHS, conducted by the Ministry of Health and Family Welfare, Government of India, in collaboration with the Registrar General of India, is a large-scale, stratified, household-level survey targeting maternal and child health across nine states in India. The dataset is freely available on the Open Government Data (OGD) Platform of India for public health research. In this study, five algorithms-DBSCAN, K-Means, Gaussian Mixture Model (GMM), Isolation Forest (IF), and One-Class SVM (1C-SVM) were applied to detect erroneous data entries. The evaluation process involved comprehensive preprocessing and feature engineering to optimize detection capabilities. Performance metrics such as precision, recall, accuracy, false anomaly, and missed anomaly rates were used to assess each algorithm. Among these, DBSCAN demonstrated superior performance, achieving a recall of 94.7% and a precision of 81.9%, making it highly effective for this task. The findings underscore the potential of unsupervised machine learning in automating the detection of data entry errors, thereby improving the integrity of public health data. This research contributes to the advancement of precision public health, supporting more accurate and reliable evidence-based decision-making and policy formulation.
引用
收藏
页数:21
相关论文
共 50 条
  • [41] A deep learning approach for anomaly detection in large-scale Hajj crowds
    Aldayri, Amnah
    Albattah, Waleed
    VISUAL COMPUTER, 2024, 40 (08): : 5589 - 5603
  • [42] Toward Robust Anxiety Biomarkers: A Machine Learning Approach in a Large-Scale Sample
    Boeke, Emily A.
    Holmes, Avram J.
    Phelps, Elizabeth A.
    BIOLOGICAL PSYCHIATRY-COGNITIVE NEUROSCIENCE AND NEUROIMAGING, 2020, 5 (08) : 799 - 807
  • [43] Evaluating machine learning methods on a large-scale of in silico fire debris data
    Tang, Larry
    Booppasiri, Slun
    Sigman, Michael E.
    Williams, Mary R.
    FORENSIC CHEMISTRY, 2025, 44
  • [44] Effective ensemble learning approach for large-scale medical data analytics
    Namamula, Lakshmana Rao
    Chaytor, Daniel
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (01) : 13 - 20
  • [45] An online conjugate gradient algorithm for large-scale data analysis in machine learning
    Xue, Wei
    Wan, Pengcheng
    Li, Qiao
    Zhong, Ping
    Yu, Gaohang
    Tao, Tao
    AIMS MATHEMATICS, 2021, 6 (02): : 1515 - 1537
  • [46] Large-scale data mining using genetics-based machine learning
    Bacardit, Jaume
    Llora, Xavier
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2013, 3 (01) : 37 - 61
  • [47] Effective ensemble learning approach for large-scale medical data analytics
    Lakshmana Rao Namamula
    Daniel Chaytor
    International Journal of System Assurance Engineering and Management, 2024, 15 : 13 - 20
  • [48] Efficient Machine Learning Force Field for Large-Scale Molecular Simulations of Organic Systems
    Hu, Junbao
    Zhou, Liyang
    Jiang, Jian
    CCS CHEMISTRY, 2025, 7 (03): : 716 - 730
  • [49] MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning
    Granziol, Diego
    Ru, Binxin
    Zohren, Stefan
    Dong, Xiaowen
    Osborne, Michael
    Roberts, Stephen
    ENTROPY, 2019, 21 (06)
  • [50] How to Improve the Teaching of Computational Machine Learning Applied to Large-Scale Data Science: The Case of Public Universities in Mexico
    Rogelio Tinoco-Martinez, Sergio
    Ferreira-Medina, Heberto
    Luis Cendejas-Valdez, Jose
    Hernandez-Rendon, Froylan
    Michell Flores-Monroy, Mariana
    Hiram Ginori-Rodriguez, Bruce
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 3, 2023, 544 : 1 - 15