Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

被引:7
作者
Shah, Neha [1 ]
Mohan, Diwakar [1 ]
Bashingwa, Jean Juste Harisson [2 ,3 ]
Ummer, Osama [4 ]
Chakraborty, Arpita [4 ]
LeFevre, Amnesty E. [1 ,5 ]
机构
[1] Johns Hopkins Bloomberg Sch Publ Hlth, Dept Int Hlth, 615 N Wolfe St, Baltimore, MD 21205 USA
[2] Univ Cape Town, Dept Integrat Biomed Sci, Fac Hlth Sci, Cape Town, South Africa
[3] Univ Cape Town, Inst Infect Dis & Mol Med, Cape Town, South Africa
[4] Oxford Policy Management, New Delhi, India
[5] Univ Cape Town, Sch Publ Hlth & Family Med, Div Epidemiol & Biostat, Cape Town, South Africa
来源
JMIR RESEARCH PROTOCOLS | 2020年 / 9卷 / 08期
关键词
quality assurance; household survey data; machine learning; monitoring; real-time data; data analytics;
D O I
10.2196/17619
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods: In the Kilkari impact evaluation's end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning-based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, "don't know" rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Data Quality for Machine Learning Tasks
    Gupta, Nitin
    Mujumdar, Shashank
    Patel, Hima
    Masuda, Satoshi
    Panwar, Naveen
    Bandyopadhyay, Sambaran
    Mehta, Sameep
    Guttula, Shanmukha
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4040 - 4041
  • [22] A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    EURASIP Journal on Advances in Signal Processing, 2016
  • [23] Survey on Data Management Technology for Machine Learning
    Cui J.-W.
    Zhao Z.
    Du X.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2021, 32 (03): : 604 - 621
  • [24] A survey of machine learning for big data processing
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,
  • [25] A Survey of Synthetic Data Generation for Machine Learning
    Abufadda, Mohammad
    Mansour, Khalid
    2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 488 - 494
  • [26] A Survey of Machine Learning Methods for Big Data
    Ruiz, Zoila
    Salvador, Jaime
    Garcia-Rodriguez, Jose
    BIOMEDICAL APPLICATIONS BASED ON NATURAL AND ARTIFICIAL COMPUTING, PT II, 2017, 10338 : 259 - 267
  • [27] A study on quality control using delta data with machine learning technique
    Liang, Yufang
    Wang, Zhe
    Huang, Dawei
    Wang, Wei
    Feng, Xiang
    Han, Zewen
    Song, Biao
    Wang, Qingtao
    Zhou, Rui
    HELIYON, 2022, 8 (08)
  • [28] Machine learning for automated quality assurance in radiotherapy: A proof of principle using EPID data description
    El Naqa, Issam
    Irrer, Jim
    Ritter, Tim A.
    DeMarco, John
    Al-Hallaq, Hania
    Booth, Jeremy
    Kim, Grace
    Alkhatib, Ahmad
    Popple, Richard
    Perez, Mario
    Farrey, Karl
    Moran, Jean M.
    MEDICAL PHYSICS, 2019, 46 (04) : 1914 - 1921
  • [29] Using Machine Learning to Optimize Graph Execution on NUMA Machines
    Rocha, Hiago Mayk G. de A.
    Schwarzrock, Janaina
    Lorenzon, Arthur F.
    Beck, Antonio Carlos S.
    PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 1027 - 1032
  • [30] Improving the Quality of Art Market Data Using Linked Open Data and Machine Learning
    Filipiak, Dominik
    Filipowska, Agata
    BUSINESS INFORMATION SYSTEMS WORKSHOPS, BIS 2016, 2017, 263 : 418 - 428