Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

被引:0
|
作者
Sau, Arkaprabha [1 ,2 ]
Phadikar, Santanu [1 ]
Bhakta, Ishita [3 ]
机构
[1] Maulana Abul Kalam Azad Univ Technol, Dept Comp Sci & Engn, BF-142,Sect 1, Kolkata 700064, W Bengal, India
[2] Minist Labour & Employment, Reg Labour Inst, DGFASLI, Govt India, Kanpur, Uttar Pradesh, India
[3] Techno Main Salt Lake, Dept Informat Technol, Kolkata, W Bengal, India
关键词
Anomaly; Data entry errors; DBSCAN; Machine learning; Public health; Unsupervised learning; ANOMALY DETECTION;
D O I
10.1186/s12982-024-00245-3
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Data entry errors in large-scale public health surveys can undermine the effectiveness of data-driven interventions. Therefore, identifying these data entry errors is crucial for public health experts. In large-scale public health surveys, manually verifying the accuracy of every data point by domain experts is nearly impossible. This study evaluates unsupervised machine learning algorithms for detecting these errors, focusing on the 'weight' parameter in the Annual Health Survey (AHS) dataset. The AHS, conducted by the Ministry of Health and Family Welfare, Government of India, in collaboration with the Registrar General of India, is a large-scale, stratified, household-level survey targeting maternal and child health across nine states in India. The dataset is freely available on the Open Government Data (OGD) Platform of India for public health research. In this study, five algorithms-DBSCAN, K-Means, Gaussian Mixture Model (GMM), Isolation Forest (IF), and One-Class SVM (1C-SVM) were applied to detect erroneous data entries. The evaluation process involved comprehensive preprocessing and feature engineering to optimize detection capabilities. Performance metrics such as precision, recall, accuracy, false anomaly, and missed anomaly rates were used to assess each algorithm. Among these, DBSCAN demonstrated superior performance, achieving a recall of 94.7% and a precision of 81.9%, making it highly effective for this task. The findings underscore the potential of unsupervised machine learning in automating the detection of data entry errors, thereby improving the integrity of public health data. This research contributes to the advancement of precision public health, supporting more accurate and reliable evidence-based decision-making and policy formulation.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] An online incremental learning support vector machine for large-scale data
    Jun Zheng
    Furao Shen
    Hongjun Fan
    Jinxi Zhao
    Neural Computing and Applications, 2013, 22 : 1023 - 1035
  • [22] An Online Incremental Learning Support Vector Machine for Large-scale Data
    Zheng, Jun
    Yu, Hui
    Shen, Furao
    Zhao, Jinxi
    ARTIFICIAL NEURAL NETWORKS-ICANN 2010, PT II, 2010, 6353 : 76 - +
  • [23] An online incremental learning support vector machine for large-scale data
    Zheng, Jun
    Shen, Furao
    Fan, Hongjun
    Zhao, Jinxi
    NEURAL COMPUTING & APPLICATIONS, 2013, 22 (05): : 1023 - 1035
  • [24] Large-scale data classification method based on machine learning model
    Department of Electrical Engineering, Dalian Institute of Science and Technology, Dalian, China
    Int. J. Database Theory Appl., 2 (71-80):
  • [25] A state machine approach for problem detection in large-scale distributed system
    Sun, Kewei
    Qiu, Jie
    Li, Ying
    Chen, Ying
    Ji, Weixing
    2008 IEEE NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, VOLS 1 AND 2, 2008, : 317 - +
  • [26] Big Data, Large-Scale Text Analysis, and Public Health Research
    Chowkwanyun, Merlin
    AMERICAN JOURNAL OF PUBLIC HEALTH, 2019, 109 : 5126 - 5127
  • [27] Unsupervised Learning Approach to Attention-Path Planning for Large-scale Environment Classification
    Lee, Hosun
    Jeong, Sungmoon
    Chong, Nak Young
    2014 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2014), 2014, : 1447 - 1452
  • [28] An Unsupervised Machine Learning Approach for Monitoring Data Fusion and Health Indicator Construction
    Huang, Lin
    Pan, Xin
    Liu, Yajie
    Gong, Li
    SENSORS, 2023, 23 (16)
  • [29] Evaluation of a computer-assisted data entry procedure (including Teleform) for large-scale mailed surveys
    Jinks, C
    Jordan, K
    Croft, P
    COMPUTERS IN BIOLOGY AND MEDICINE, 2003, 33 (05) : 425 - 437
  • [30] Machine learning prediction of incidence of Alzheimer's disease using large-scale administrative health data
    Park, Ji Hwan
    Cho, Han Eol
    Kim, Jong Hun
    Wall, Melanie M.
    Stern, Yaakov
    Lim, Hyunsun
    Yoo, Shinjae
    Kim, Hyoung Seop
    Cha, Jiook
    NPJ DIGITAL MEDICINE, 2020, 3 (01)