Efficient detection of data entry errors in large-scale public health surveys: an unsupervised machine learning approach

被引：0

作者：

Sau, Arkaprabha ^{[1
,2
]}

Phadikar, Santanu ^{[1
]}

Bhakta, Ishita ^{[3
]}

机构：

[1] Maulana Abul Kalam Azad Univ Technol, Dept Comp Sci & Engn, BF-142,Sect 1, Kolkata 700064, W Bengal, India

[2] Minist Labour & Employment, Reg Labour Inst, DGFASLI, Govt India, Kanpur, Uttar Pradesh, India

[3] Techno Main Salt Lake, Dept Informat Technol, Kolkata, W Bengal, India

来源：

DISCOVER PUBLIC HEALTH | 2024年 / 21卷 / 01期

关键词：

Anomaly; Data entry errors; DBSCAN; Machine learning; Public health; Unsupervised learning; ANOMALY DETECTION;

D O I：

10.1186/s12982-024-00245-3

中图分类号：

R1 [预防医学、卫生学];

学科分类号：

1004 ; 120402 ;

摘要：

Data entry errors in large-scale public health surveys can undermine the effectiveness of data-driven interventions. Therefore, identifying these data entry errors is crucial for public health experts. In large-scale public health surveys, manually verifying the accuracy of every data point by domain experts is nearly impossible. This study evaluates unsupervised machine learning algorithms for detecting these errors, focusing on the 'weight' parameter in the Annual Health Survey (AHS) dataset. The AHS, conducted by the Ministry of Health and Family Welfare, Government of India, in collaboration with the Registrar General of India, is a large-scale, stratified, household-level survey targeting maternal and child health across nine states in India. The dataset is freely available on the Open Government Data (OGD) Platform of India for public health research. In this study, five algorithms-DBSCAN, K-Means, Gaussian Mixture Model (GMM), Isolation Forest (IF), and One-Class SVM (1C-SVM) were applied to detect erroneous data entries. The evaluation process involved comprehensive preprocessing and feature engineering to optimize detection capabilities. Performance metrics such as precision, recall, accuracy, false anomaly, and missed anomaly rates were used to assess each algorithm. Among these, DBSCAN demonstrated superior performance, achieving a recall of 94.7% and a precision of 81.9%, making it highly effective for this task. The findings underscore the potential of unsupervised machine learning in automating the detection of data entry errors, thereby improving the integrity of public health data. This research contributes to the advancement of precision public health, supporting more accurate and reliable evidence-based decision-making and policy formulation.

引用

页数：21

共 50 条

[21] An online incremental learning support vector machine for large-scale data
Jun Zheng
Furao Shen
Hongjun Fan
Jinxi Zhao
Neural Computing and Applications, 2013, 22 : 1023 - 1035
[22] An Online Incremental Learning Support Vector Machine for Large-scale Data
Zheng, Jun
Yu, Hui
Shen, Furao
Zhao, Jinxi
ARTIFICIAL NEURAL NETWORKS-ICANN 2010, PT II, 2010, 6353 : 76 - +
[23] An online incremental learning support vector machine for large-scale data
Zheng, Jun
Shen, Furao
Fan, Hongjun
Zhao, Jinxi
NEURAL COMPUTING & APPLICATIONS, 2013, 22 (05): : 1023 - 1035
[24] Large-scale data classification method based on machine learning model
Department of Electrical Engineering, Dalian Institute of Science and Technology, Dalian, China
Int. J. Database Theory Appl., 2 (71-80):
[25] A state machine approach for problem detection in large-scale distributed system
Sun, Kewei
Qiu, Jie
Li, Ying
Chen, Ying
Ji, Weixing
2008 IEEE NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, VOLS 1 AND 2, 2008, : 317 - +
[26] Big Data, Large-Scale Text Analysis, and Public Health Research
Chowkwanyun, Merlin
AMERICAN JOURNAL OF PUBLIC HEALTH, 2019, 109 : 5126 - 5127
[27] Unsupervised Learning Approach to Attention-Path Planning for Large-scale Environment Classification
Lee, Hosun
Jeong, Sungmoon
Chong, Nak Young
2014 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2014), 2014, : 1447 - 1452
[28] An Unsupervised Machine Learning Approach for Monitoring Data Fusion and Health Indicator Construction
Huang, Lin
Pan, Xin
Liu, Yajie
Gong, Li
SENSORS, 2023, 23 (16)
[29] Evaluation of a computer-assisted data entry procedure (including Teleform) for large-scale mailed surveys
Jinks, C
Jordan, K
Croft, P
COMPUTERS IN BIOLOGY AND MEDICINE, 2003, 33 (05) : 425 - 437
[30] Machine learning prediction of incidence of Alzheimer's disease using large-scale administrative health data
Park, Ji Hwan
Cho, Han Eol
Kim, Jong Hun
Wall, Melanie M.
Stern, Yaakov
Lim, Hyunsun
Yoo, Shinjae
Kim, Hyoung Seop
Cha, Jiook
NPJ DIGITAL MEDICINE, 2020, 3 (01)

← 1 2 3 4 5 →